5 Qualitative Data – Econometrics with R

5.1 Topics of this chapter

Interaction terms
Dummy variables
- Binary dummies
- Interaction with dummy variables
- Treatment effects and the Difference in Difference Estimator
- Chow test
- Dummies for multiple characteristics - categorical data
Binary variables as dependent variable – the linear probability model (LPM )
Prediction errors
- Variance and confidence intervals for predictions
- Studentized residuals

5.2 Interaction terms

In Section 2.9.4, we discussed a model, where unobservable factors affect the slope parameters \boldsymbol \beta – this was the random coefficient model, which poses no real additional problems to estimation (besides heteroscedasticity).

In this section we discuss the influence of the level of a observably variable, x_2, on the partial effect (the slope) of x_1 on y. This is modeled with interaction terms, which are the product of the two (or more) regressors:

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + u \tag{5.1}

So, if you want the partial effect of x_1 on y we have

\dfrac {\partial y}{\partial x_1} = \beta_1 + \beta_3 x_2 \tag{5.2}

Clearly, this partial effect depends on the level of x_2. Note that \beta_1 is the the partial effect of x_1 only if x_2 = 0, which may be not the case for any observation in the sample

A reparametrization

A more reasonable parametrization is:

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 - \bar {x}_1) (x_2 - \bar {x}_2) + u \tag{5.3}

Now, \beta_1 is the partial effect of x_1 on y if (x_2 - \bar {x}_2) = 0, which means that x_2 has the value of is own mean. The same is true for the partial effect of x_2, which equals \beta_2 if x_1 = \bar {x}_1
This parametrization has several advantages
- Easy interpretation of all parameters
- We have standard errors for partial effects at the mean values available
- If necessary, interaction may be centered at other interesting values

Average partial effects

In models with quadratic, interactions, and other nonlinear functional forms, the partial effect depends on the values of one or more explanatory variables
The Average Partial Effect (APE) is a summary measure to describe the relationship between dependent variable and each explanatory variable
- After computing the partial effect for every unit i by plugging in the estimated parameters and the values for the variable x_{i,2} into Equation 5.2, average these individual partial effects across the units of the sample. For our previous example we have: \widehat {APE} \, = \, \frac {1}{n}\sum_{i=1}^n (\hat \beta_1 + \hat \beta_3 x_{i,2}) \, = \, \hat \beta_1 + \hat \beta_3 \frac {1}{n}\sum_{i=1}^n x_{i,2} \, = \, \hat \beta_1 + \hat \beta_3 \bar x_{i,2} \tag{5.4}
- Below, an example shows this is the same as \beta_1 in Equation 5.3 ¹
In R, the APE is easily computed with the avg_slopes() function from the marginaleffects package

5.2.1 Interaction terms - Example

In the following example from Wooldridge (2019) we want to estimate the effect of visiting a lecture for microeconomics, measured by the attendance rate atndrte (% of visiting the lecture), on the success for the final test, measured by stndfnl, a scaled test score (mean = 0, sd = 1)

We estimate the following model:

stndfnl = \beta_0 + \beta_1 \, atndrte + \beta_2 \, priGPA + \beta_3 ACT + \beta_4 \, priGPA^2+ \beta_5 ACT^2 \\ + \beta_6 \, atndrte * priGPA + u

We have two control variables: priGPA (prior GPA result) and ACT (another test result). These should control for the general ability of the students
We further want to know, whether good students benefits more or less from attendance the lecture. This is measured by the interaction variable atndrte\, * \,priGPA:
- If this variable has a positive coefficient, then more capable students (with a higher priGPA) exhibit a higher partial effect of attendance rate und thus benefit more from visiting the lecture \frac {\partial stndfnl}{\partial atndrte} = \beta_1 + \beta_6 \, priGPA
- The interpretation of the coefficient of atndrte, \beta_1, taken for is own is the partial effect of atndrte at a value of 0 for priGPA (which nobody in the sample has)
Furthermore, we allow for additional nonlinear (quadratic) effects of the control variables

rm(list = ls()) # delete all
library(AER)
library(wooldridge)
library(marginaleffects)
library(ggplot2)
library(effects)
library(modelsummary)
data("attend")

# estimation of the model, storing results in list myres
myres <- lm(stndfnl ~ atndrte * priGPA + ACT + I(priGPA^2) + I(ACT^2), data = attend)

# storing coefficients in vector b for convenience
b <- myres$coefficients

In an R formula, with the formulation (atndrte * priGPA), the variables atndrte, priGPA and the interaction variable atndrte : priGPA are automatically generated. The function I() is necessary whenever the function inside I() should be taken literally

The results are:

Code

# you can simply use summary(myres) to show the output
# summary(myres)
# or more pretty:
modelsummary(list("Model 1"=myres),
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'),
             stars = TRUE,
             gof_omit = "A|L|B|F",
             coef_omit = "Interc",
             align = "ldddddd",
             fmt = 4,
             output = "gt")

Table 5.1: Effect of attendance rate on test score

	Model 1
	Est.	S.E.	t	p	2.5 %	97.5 %
atndrte	-0.0067	0.0102	-0.6561	0.5120	-0.0268	0.0134
priGPA	-1.6285***	0.4810	-3.3857	0.0008	-2.5730	-0.6841
ACT	-0.1280	0.0985	-1.3000	0.1940	-0.3214	0.0653
I(priGPA^2)	0.2959**	0.1010	2.9283	0.0035	0.0975	0.4943
I(ACT^2)	0.0045*	0.0022	2.0829	0.0376	0.0003	0.0088
atndrte × priGPA	0.0056	0.0043	1.2938	0.1962	-0.0029	0.0141
Num.Obs.	680
R2	0.229
RMSE	0.87
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

\beta_1 = -0.0067 is the partial effect for atndrte at priGPA = 0, which is quite irrelevant and also has the wrong sign (attendees of a lecture should be beneficial)

APE

Now, we want to calculate the APE of atndrte at the mean of priGPA, which is 2.587 (Equation 5.4). Remember, we stored the regression coefficients in the vector b

# estimate partial effect at the mean of priGPA = 2.586775
b["atndrte"] + mean(attend$priGPA) * b["atndrte:priGPA"]

         atndrte 
     0.007736558

This partial effect is positive (as expected) and also quite high. It means that if attendance rate change from 0% to 100% for an average student, the test score improves by 77% of the standard deviation of the test score (remember, stndfnl is scaled to mean zero and standard deviation = 1)
Is this effect statistically significant? The following test says yes:

# test partial effect 
lht(myres, "atndrte + 2.586775 * atndrte:priGPA = 0" )

     
     Linear hypothesis test:
     atndrte  + 2.586775 atndrte:priGPA = 0
     
     Model 1: restricted model
     Model 2: stndfnl ~ atndrte * priGPA + ACT + I(priGPA^2) + I(ACT^2)
     
       Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
     1    674 519.34                                
     2    673 512.76  1    6.5786 8.6344 0.003412 **
     ---
     Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As always with a F-test with only one restriction, the t-statistic is \sqrt F = \sqrt {8.6344} = 2.9384

Using the procedure avg_slopes() from marginaleffects package, the APE (or AME) should be the same as the calculated value above, and this is the case.

# checking results with "avg_slopes()":
m <- avg_slopes(myres)

# You can simply use m to show the output
m


    Term Estimate Std. Error    z Pr(>|z|)    S   2.5 % 97.5 %
 ACT      0.07606    0.01120 6.79   <0.001 36.4 0.05411 0.0980
 atndrte  0.00774    0.00263 2.94   0.0033  8.2 0.00258 0.0129
 priGPA   0.35876    0.07779 4.61   <0.001 17.9 0.20630 0.5112

Type: response
Comparison: dY/dX

# But more pretty is the following with `modelsummary`

Code

modelsummary(list("APE for model 1"=m), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "",
             align = "ldddddd",
             fmt = 4,
             output = "gt")

Table 5.2: Average partial effects of the variables on test score, calculated with avg_slopes()

	APE for model 1
	Est.	S.E.	t	p	2.5 %	97.5 %
ACT	0.0761***	0.0112	6.7913	<1e-04	0.0541	0.0980
atndrte	0.0077**	0.0026	2.9383	0.0033	0.0026	0.0129
priGPA	0.3588***	0.0778	4.6120	<1e-04	0.2063	0.5112
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

avg_slopes() calculated the APEs (or AMEs) for all variables and additionally carried out significance test and calculated confidence intervals.

The APE for atndrte is the same as the one calculated “by hand” above.

The prob-values are slightly different, because avg_slopes() uses the normal distribution instead of the t-distribution (asymptotic test, therefore the label z-value is used instead of t-value)

With the plot_predictions() function of the marginaleffects package, the effects of attendance rate, subject to different values of pirGPA, can be displayed: more capable students seem to benefit more.

plot_predictions(myres, condition = list("atndrte", "priGPA"=seq(1,3.5,0.5))) + 
  facet_wrap(~ priGPA) + 
  theme_bw()

Figure 5.1: Effect of `atndrte` on `stndfnl`, by `priGPA`

However, the effect is not significant - look for Est. and t-value of the interaction term in the regression Table 5.1

Reparametrization

We estimate the same model once more, but this time with a reparametrization: The interaction term now is atndrte * (priGPA - \overline{priGPA})

With this parametrization, the partial effect of atndrte, \beta_1, is measured at the mean of priGPA, which therefore corresponds to the APE, see Equation 5.3
- In R, de-meaning a variable is done with the function scale(variable, scale=FALSE). Without the option scale=FALSE, the scale function does not only subtract the mean but also divides by the standard deviation

As an aside: If all variables, including y, are scaled with scale(variable), the corresponding coefficients are (\sigma_{x_j} / \sigma_y) \beta_j, which are the so called standardized coefficients. They measure the effects of a one standard deviation variation in x_j (a typical change in x_j) in terms of standard deviations in y. Therefore, they show the relative importance of variations of the x-es for variations in y

# estimation of the model, this time with de-meaned priGPA in the interaction term so, 
# the interpretation of the coef of atndrte is the partial effect of atndrte 
# at the mean of priGPA

myres1 <- lm(stndfnl ~ atndrte + priGPA + ACT + I(priGPA^2) + I(ACT^2) + 
               atndrte:scale(priGPA, scale = FALSE), data = attend)

As the following regression output shows, the estimated APE for atndrte and the t-value exactly correspond to the values we got with the avg_slopes() function in Table 5.2.

Code

# summary(myres1)

modelsummary(list("Model 2"=myres1), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             coef_omit = "Interc",
             align = "ldddddd",
             fmt = 4,
             output = "gt")

Table 5.3: Effect of attendance rate on test score, reparametrized model 2

	Model 2
	Est.	S.E.	t	p	2.5 %	97.5 %
atndrte	0.0077**	0.0026	2.9384	0.0034	0.0026	0.0129
priGPA	-1.6285***	0.4810	-3.3857	0.0008	-2.5730	-0.6841
ACT	-0.1280	0.0985	-1.3000	0.1940	-0.3214	0.0653
I(priGPA^2)	0.2959**	0.1010	2.9283	0.0035	0.0975	0.4943
I(ACT^2)	0.0045*	0.0022	2.0829	0.0376	0.0003	0.0088
atndrte × scale(priGPA, scale = FALSE)	0.0056	0.0043	1.2938	0.1962	-0.0029	0.0141
Num.Obs.	680
R2	0.229
RMSE	0.87
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Code

modelsummary( list(myres, myres1), 
              output="gt", 
              statistic = "statistic",
              coef_omit = "Inter",
              gof_omit = "A|B|L|F", 
              align = "ldd", 
              stars = TRUE, 
              fmt = 4)

Table 5.4: Comparing model 1 with the reparametrized model 2

	(1)	(2)
atndrte	-0.0067	0.0077**
	(-0.6561)	(2.9384)
priGPA	-1.6285***	-1.6285***
	(-3.3857)	(-3.3857)
ACT	-0.1280	-0.1280
	(-1.3000)	(-1.3000)
I(priGPA^2)	0.2959**	0.2959**
	(2.9283)	(2.9283)
I(ACT^2)	0.0045*	0.0045*
	(2.0829)	(2.0829)
atndrte × priGPA	0.0056
	(1.2938)
atndrte × scale(priGPA, scale = FALSE)		0.0056
		(1.2938)
Num.Obs.	680	680
R2	0.229	0.229
RMSE	0.87	0.87
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Hence, as argued above, this reparametrization directly delivers the APE

We check with avg_slopes() from the marginaleffects package and once more get the same results

# checking with "avg_slopes": calculate APE
m1 <- avg_slopes(myres1)
m1

     
         Term Estimate Std. Error    z Pr(>|z|)    S   2.5 % 97.5 %
      ACT      0.07606    0.01120 6.79   <0.001 36.4 0.05411 0.0980
      atndrte  0.00774    0.00263 2.94   0.0033  8.2 0.00258 0.0129
      priGPA   0.35876    0.07779 4.61   <0.001 17.9 0.20630 0.5112
     
     Type: response
     Comparison: dY/dX

Diagrams of non-linear effects

Finally, we want to plot the non-linear effects of priGPA and ACT at the means of all other variables.

Both have quadratic terms in the model and priGPA additionally an iteration with atndrte. So, calculating the effects by hand would be quite tedious. Fortunately, the function plot_predictons() of the marginaleffects package does the work for us
- With plot_slopes(), you get the marginal effect (slope) of the variable, like in Equation 5.2
- With plot_predictons(), you get the effect of the variable on the expected value of the dependent variable

plot_predictions(myres1, condition="priGPA") + theme_bw()
plot_slopes(myres1, variables="priGPA", condition=list("priGPA", "atndrte"=c(0, 50, 100)) ) + theme_bw()

plot_predictions(myres1, condition="ACT") + theme_bw()
plot_slopes( myres1, variables="ACT", condition="ACT" ) + theme_bw()

5.3 Dummy variables

Dummy variables contain qualitative information which are typically coded as 1 or 0.

Examples for qualitative information are
- Cross section data: gender, race, family background, industry, region, rating grade, poeple who got a special treatment or not, …
- Times series data: seasonality, special events like strikes or natural catastrophes, time before and after a certain date / policy intervention, …
Dummy variables may appear as the dependent or as independent variables
- Example for a dependent dummy variable: InLaborForce = \beta_0 + \beta_1 FamilyIncome + \beta_2 NumOfChildrens + \ \cdots \ + u Here, InLaborForce is a dummy variable with value equal 1 if the person (woman) is in the labor force (working), and a value equal 0 if the person is not. Such cases are not treated in this section, but in a later one about the linear probability model
- Example for an independent dummy variable (which are treated in this section): wage = \beta_0 + \beta_1 female + \ \cdots \ + u Here, female is a dummy variable with value equal 1 if the person is a woman, and a value equal 0 if the person is a man
  - The parameter \beta_1 represents the gain/loss in wages, if the person is a woman rather than a men (holding other things, which are controlled for, fixed)

Example data

library(AER); library(wooldridge)
data("wage1")

library(gt)
gt(head(wage1, 10))

wage	educ	exper	tenure	female	married	numdep	smsa	west	trade	services	profserv	profocc	clerocc	servocc	lwage	expersq	tenursq
3.10	11	2	0	1	0	2	1	1	0	0	0	0	0	0	1.131402	4	0
3.24	12	22	2	1	1	3	1	1	0	1	0	0	0	1	1.175573	484	4
3.00	11	2	0	0	0	2	0	1	1	0	0	0	0	0	1.098612	4	0
6.00	8	44	28	0	1	0	1	1	0	0	0	0	1	0	1.791759	1936	784
5.30	12	7	2	0	1	1	0	1	0	0	0	0	0	0	1.667707	49	4
8.75	16	9	8	0	1	0	1	1	0	0	1	1	0	0	2.169054	81	64
11.25	18	15	7	0	0	0	1	1	1	0	0	1	0	0	2.420368	225	49
5.00	12	5	3	1	0	0	1	1	0	0	0	1	0	0	1.609438	25	9
3.60	12	26	4	1	0	2	1	1	1	0	0	1	0	0	1.280934	676	16
18.18	17	22	21	0	1	0	1	1	0	0	0	1	0	0	2.900322	484	441

Example - Wage equation

We estimate model explaining log(wage) with education and gender

out <- lm(lwage ~ female, data = wage1)

Code

modelsummary(list("Wage dependent on gender "=out), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "AI|L|F|B",
             align = "ldddddd",
             output = "gt")

Table 5.5: Wage equation

	Wage dependent on gender
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	1.814***	0.030	60.830	<0.001	1.755	1.872
female	-0.397***	0.043	-9.222	<0.001	-0.482	-0.313
Num.Obs.	526
R2	0.140
R2 Adj.	0.138
RMSE	0.49
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Interpretation of results of this example

The intercept, 1.814, is the average log(wage) of men (as there are no other variables in the model)
The coefficient of female shows the difference in the averages of log(wages) of men and woman
- A value of -0.397 means that women have a lower log(wage) of 0.397 than men. Hence, a woman earns on average about 40% less
- As there are no other control variables in the model, this results are with regard to average persons, which differ in their gender. The difference in the means is statistically highly significant
- We can get a similar information with a Boxplot

Boxplot

boxplot( wage1$lwage ~ wage1$female )

Dummy variable trap

When using dummy variables, one category always has to be omitted
For instance, if we have a dummy for female, we must not include a dummy for man when there is an intercept in the equation
The reason is that the two dummy variables would sum to one for every observation and therefore would be perfect collinear with the intercept which makes an estimation impossible

Because one category is left out, the coefficient of the dummy variable measures the effect of the dummy category relative to the one left out, which serve as a basis

The previous regression show a statistical significant difference in earned wages between man and woman

Does this mean that woman are discriminated?
Not really, because the difference could be due to objective and relevant factors like education, experience, tenure, occupation and so on. Omission of these factors probably leads to an omitted variable bias. As we have data on some of these factors, we can control for them

Hence, we estimate a more elaborate version the following model

log(wage) \ = \ \beta_0 + \beta_1 female + \beta_2 educ + \beta_3 exper + \beta_4 exper^2 + \\ \beta_5 tenure + \beta_6 tenure^2 + u

out1 <- lm(lwage ~ female + educ + exper + I(exper^2) + 
             tenure + I(tenure^2), data = wage1)

Code

modelsummary(list("Wage dependent on gender and many control variables"=out1), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "AI|L|F|B",
             align = "ldddddd",
             output = "gt")

Table 5.6: Extended wage equation

	Wage dependent on gender and many control variables
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	0.417***	0.099	4.212	<0.001	0.222	0.611
female	-0.297***	0.036	-8.281	<0.001	-0.367	-0.226
educ	0.080***	0.007	11.868	<0.001	0.067	0.093
exper	0.029***	0.005	5.916	<0.001	0.020	0.039
I(exper^2)	-0.001***	0.000	-5.431	<0.001	-0.001	-0.000
tenure	0.032***	0.007	4.633	<0.001	0.018	0.045
I(tenure^2)	-0.001*	0.000	-2.493	0.013	-0.001	-0.000
Num.Obs.	526
R2	0.441
R2 Adj.	0.434
RMSE	0.40
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Interpretation of results

The intercept is the log(wage) of men (basis category) with zero education, zero, experience and zero tenure and is about 0.417. Note, how large the difference is to the first version without controls (Table 5.5), where the results are with regard of an average man with some education, some experience and tenure
The coefficient of female shows the difference in the log(wages) of men and woman
- A value of -0.297 means that a woman with the same level of education, experience and tenure as a man earns on average about 30% less.
  - But note, that the formulation of the model presumes that the effect of gender on wages is equal for all individuals. For a more general formulation, slope effects via interaction variables have to be take into account; see next section
One year more of education leads to a higher wage of 8%, independent of gender
The effects of experience and tenure are non-linear modeled, so the partial effects depend on the corresponding levels of the variables and are therefore hard to interpret
- But as the coefficients of the linear terms are positive and the coefficients of the quadratic ones are negative, we can guess that the effects decline with higher experience and tenure. As we already know, we can use the function avg_slope() of the marginaleffectspackage to calculate the APEs, and the functionplot_predictions()` to plot the non-linear relationships
Note, that the Adjusted R-squared is now 0.43 compared to 0.14 in the model without controls. Hence, education, experience and tenure are indeed very important factors of wage earnings

5.4 Interactions with dummy variables

So far, the specification of the model contains one dummy variable for gender. But this can only explain a ceteris paribus difference in the level of wage earnings for woman but do not allow that the payoffs (slopes) for education, experience and tenure depend on gender. However, these slopes could change with gender as well, leading to the fact that gender effects are not equal for all individuals
If one wants to test for the general effects of gender on wage earnings these slope effects have to be taken into account. This can be done with interaction terms for all variables with the female dummy
In R, this can be very easily accomplished with the following notation using ” * ” lwage \ \sim \ female * ( \ educ + exper + I(exper^2) + tenure + I(tenure^2) \ ) With this formulation, all variables within the brackets are included, plus all interaction terms for every variable (including the intercept) with female
After estimation, one can test whether the coefficients of all terms involving female (6 in our case) are jointly zero with an usual F-test. If H₀ is rejected, gender has a significant effect an wage earnings
The difficulty with this general formulation is that it is not so easy to say how large the wage differential actual is. The reason is, there may be a negative effect in the intercept but different effects on the slopes (payoffs) of education, experience or tenure. Therefore, you have to relay once again on APEs, i.e., you have to calculate an average effect over all individuals with avg_slopes()

out2 <- lm(lwage ~ female * (educ + exper + I(exper^2) + tenure + I(tenure^2)), data = wage1)
# summary(out2)

Code

modelsummary(list("Wage dependent on gender and many control variables"=out2), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "AI|L|F|B",
             align = "ldddddd",
             output = "gt")

Table 5.7: Extended wage equation with slope effects

	Wage dependent on gender and many control variables
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	0.215+	0.129	1.659	0.098	-0.040	0.469
female	0.108	0.193	0.561	0.575	-0.271	0.487
educ	0.087***	0.009	9.881	<0.001	0.070	0.104
exper	0.040***	0.007	5.708	<0.001	0.027	0.054
I(exper^2)	-0.001***	0.000	-4.955	<0.001	-0.001	-0.000
tenure	0.033***	0.009	3.696	<0.001	0.015	0.050
I(tenure^2)	-0.001*	0.000	-1.981	0.048	-0.001	-0.000
female × educ	-0.014	0.014	-1.034	0.302	-0.041	0.013
female × exper	-0.023*	0.010	-2.340	0.020	-0.042	-0.004
female × I(exper^2)	0.000+	0.000	1.794	0.073	-0.000	0.001
female × tenure	0.007	0.015	0.451	0.652	-0.022	0.036
female × I(tenure^2)	-0.001	0.001	-1.443	0.150	-0.002	0.000
Num.Obs.	526
R2	0.461
R2 Adj.	0.450
RMSE	0.39
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Interpretation of the coefficients

The coefficients (slopes) without female are the ones relevant for males, whereas the coefficients with female show the difference of the slopes for woman compared to man. Therefore, to get the effects for woman, you have to add the relevant coefficients. For instance, the marginal effect (first partial derivative) of experience on woman is:

\beta_4 + 2 \beta_5 exper + \beta_9 female + 2\beta_{10} female \cdot exper

Clearly, this marginal effect depends on the values of experience and gender. Use the R function avg_slopes() to calculate an average effect
- An alternative would be to recenter (demean) the interaction variables before estimation, as discussed in the section “Interaction terms” above
To test whether gender has an overall effect on wage earnings (intercept and/or slopes) we carry out an F-test. Under the H₀, all interaction effect are zero, i.e., the same regression coefficients apply to men and women
With the help of the R function matchCoef() it is easy to choose all variables which contain female in their name

Test, whether all variables which contain `female` are jointly insignificant

As expected, the H₀ is clearly rejected:

lht( out2, matchCoefs(out2, "female") )

     
     Linear hypothesis test:
     female = 0
     female:educ = 0
     female:exper = 0
     female:I(exper^2) = 0
     female:tenure = 0
     female:I(tenure^2) = 0
     
     Model 1: restricted model
     Model 2: lwage ~ female * (educ + exper + I(exper^2) + tenure + I(tenure^2))
     
       Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
     1    520 93.911                                  
     2    514 79.920  6    13.991 14.997 7.654e-16 ***
     ---
     Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The APEs for the variables

Code

m2 <- avg_slopes(out2)

modelsummary(list("APEs"=m2), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "",
             align = "ldddddd",
             output = "gt")

Table 5.8: APEs for the extended wage equation with slope effects

	APEs
	Est.	S.E.	t	p	2.5 %	97.5 %
educ	0.080***	0.007	11.712	<0.001	0.067	0.093
exper	0.010***	0.002	5.164	<0.001	0.006	0.013
female	-0.314***	0.036	-8.790	<0.001	-0.384	-0.244
tenure	0.027***	0.005	5.286	<0.001	0.017	0.037
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

The APE for female shows the overall wage differential for two average persons which only differ in gender. The effect is about -31%
If you explicitly want the average differences in slope effects for an average man and an average woman you have to average over men and woman separately, which is shown next

Code

# in averaging, we only consider man 
m3 <- avg_slopes(out2, newdata = subset(female==0))

modelsummary(list("APEs"=m3), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "",
             align = "ldddddd",
             output = "gt")

Table 5.9: APEs for the extended wage equation with slope effects, averaged only for man

	APEs
	Est.	S.E.	t	p	2.5 %	97.5 %
educ	0.087***	0.009	9.883	<0.001	0.070	0.104
exper	0.014***	0.003	5.222	<0.001	0.009	0.019
female	-0.342***	0.038	-9.036	<0.001	-0.417	-0.268
tenure	0.025***	0.006	4.491	<0.001	0.014	0.036
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

In this case, female tells us, how much an average man would loose if he were a woman. Look also at the slope effects of educ, tenure and experience, which are now the marginal effects for an average man and not an average person.

We make the same for an average woman, by averaging only over woman:

Code

# in averaging, we only consider woman 
m4 <- avg_slopes(out2, newdata = subset(female==1))

modelsummary(list("APEs"=m4), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "",
             align = "ldddddd",
             output = "gt"
            )

Table 5.10: APEs for the extended wage equation with slope effects, averaged only for woman

	APEs
	Est.	S.E.	t	p	2.5 %	97.5 %
educ	0.073***	0.011	6.860	<0.001	0.052	0.093
exper	0.005+	0.003	1.942	0.052	-0.000	0.010
female	-0.283***	0.036	-7.902	<0.001	-0.353	-0.213
tenure	0.029***	0.009	3.313	<0.001	0.012	0.046
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Now, female tells us, how much an average woman would gain if she were a man. Why is this different?

Here is a graphical representation of all these estimates of the APEs’, subject to gender:

plot_slopes(out2, variables=c("educ", "exper", "tenure", "female"), by="female" ) + theme_bw()

5.5 Treatment effects and the DiD estimator

Another important application of interaction terms is the evaluation of treatment effects of so called quasi or natural experiments. ²

As an example suppose we want to estimate the effect of a medical treatment on the outcome of a disease
To do this we need two groups of patients, treated and untreated, randomly assigned, which we observe at least for two periods – the period before and after treatment
- To measure the progress of the disease for the treated patients alone is not sufficient because other effects could be in play which effects treated and untreated (e.g., placebo effect in this example, or general trends)
  - To control for these, one strategy is to compare the outcome of the disease for the treated patients with the outcome of the untreated patients over time.
    By focusing on the differece of changes in y over the course of the experiments, we can control for some pre-treatment differences (e.g., not a really random assignment, which affects only the level of the outcome and not the trend) and for factors which affect treated and untreated equally
- So we examine the difference in the clinical status of the treated patients form period one to period two and compare this with the difference in the clinical status of the untreated patients form period one to period two. The difference of these two differences leads to the treatment effect looked for. Therefore the name difference in difference (DiD) estimator

Example – house prices

Let us examine another example with actual data.

In 1981, a garbage incinerator (Müllverbrennungsanlage) was built in North Andover, Massachusetts
We have a pooled cross-section data set of prices for sold houses for the years 1978 and 1981 and want to examine whether this building negatively affected the house prices for houses nearby (within 3 miles) the incinerator (this is the “treatment” group)

The following R code carries out an DiD estimator for this example:

library(wooldridge)

# Data set "hprice3" of the Wooldridge package
data(hprice3)


# Defining the dummy variable "nearinc" from the variable "dist", 
# which is the distance from the incinerator in feet; 15840 = 3 miles
nearinc <- as.numeric(0 + hprice3$dist<15840)


# Control group:
# Difference in the average log house prices "lprices" for houses 
# not near the incinerator, from 78  to 81 
D1 <- mean( hprice3$lprice[ hprice3$y81 == 1 & nearinc == 0 ] ) - 
      mean( hprice3$lprice[ hprice3$y81 == 0 & nearinc == 0 ] )


# Treatment group:
# Difference in the average log house prices "lprices" for houses 
# near the incinerator, from 78  to 81 ()
D2 <- mean( hprice3$lprice[ hprice3$y81 == 1 & nearinc == 1 ] ) - 
      mean( hprice3$lprice[ hprice3$y81 == 0 & nearinc == 1 ] )


# Difference of the two changes -- classical DiD estimator
DiD <- D2-D1

Code

```{r}
#| comment: "   "
#| code-fold: true

# Building strings for printing
paste( "D1 (difference of houseprices from 78 to 81, not near incin.) = ", round(D1, 3) )
paste( "D2 (difference of houseprices from 78 to 81, near incin.) = ", round(D2, 3) )
paste( "DiD (difference of difference, D2 - D1)  = ", round(DiD, 3) )
```

    [1] "D1 (difference of houseprices from 78 to 81, not near incin.) =  0.457"
    [1] "D2 (difference of houseprices from 78 to 81, near incin.) =  0.394"
    [1] "DiD (difference of difference, D2 - D1)  =  -0.063"

Result of the example above:

The incinerator leads to an estimated decrease of house prices of 6.3% for houses nearby the incinerator. (If other factors, which affect the control and treatment group differently plays no role – parallel trend assumption, which is essential for DiD estimations)

How does this work?

We are interested in the treatment effect on the variable Y which is defined as (Y_T^T[1] - Y_T^{NT}[1])
- Thereby Y_T^T[1] is the value of Y of the treated group (subscript T) at period 1 after treatment (superscript T)
- Y_T^{NT}[1] equals to the value of treated group at period 1, but hypothetically not treated
  - Hence, we compare the Y-value of the treated group after treatment (which we can observe) with the Y-value of the treated group under the hypothesis that this group had not been treated. But this is counterfactual and thus not observed
  - Therefore, we need an estimate of the hypothetical Y_T^{NT}[1]. And here comes the basic assumption of all DiD estimations – the parallel trend assumption:
    - We assume that the change over time of the treated group would have been the same as the change of the untreated group, if the treated group had not been treated
  - Thus an estimate of Y_T^{NT}[1] is: \; \widehat {Y_T^{NT}[1]} \ = \ Y_T[0] + (Y_{NT}[1] - Y_{NT}[0]).
    This this the Y-value of the treated group in period 0 (before all treatments) plus the change over time of the untreated group. We can plug this result into our definition of the treatment effect above and reach to:
  - ( Y_T^T[1] - \widehat {Y_T^{NT}[1]} ) \ = \ \underbrace {(Y_T^T[1] - Y_T[0])}_{D2} - \underbrace {(Y_{NT}[1] - Y_{NT}[0])}_{D1}
- But this is the DiD estimator used in our example!

DiD estimator as interaction term

The same analysis, which was somewhat tedious, can be done in a more simple and flexible way by a regression with an interaction term ³

lprice = \beta_0 + \beta_1 \, nearinc + \beta_2 \, y81 + \beta_3 \, (y81 \times nearinc) + u

With lprice equals log(house prices)
nearinc is a dummy which is equal to one if the house is located near the incinerator (within 3 miles) and zero otherwise (this is the treatment dummy)
y81 is a dummy variable which equals zero for year 1978 and with a value equal one for year 1981 (this is a time dummy)
y81 * nearinc is an interaction between y81 and nearinc

Thereby:

\beta_0 measures the price for houses not near the incinerator in the year 78; y81 and nearinc are both 0
\beta_1 measures the general price difference for houses near the incinerator to those not near the incinerator, for the year 78; y81 = 0
\beta_2 measures the price difference from 78 to 81, for house not near the incinerator; nearinc = 0
\beta_3 equals the classical DiD estimator: It measures the price difference over time for houses near minus the price difference over time for houses not near the incinerator
- Why is this the case:
- We have for the price increase not near the incinerator (control group):
  (lprice_{81} \mid nearinc = 0) \ - \ (lprice_{78} \mid nearinc = 0) \ = \\ (\beta_0 + \beta_2) \ - \ \beta_0 \ = \ \beta_2
- We have for the price increase near the incinerator (treatment group):
  (lprice_{81} \mid nearinc = 1) \ - \ (lprice_{78} \mid nearinc = 1) \ = \\ (\beta_0 + \beta_1 + \beta_2 + \beta_3) \ - \ (\beta_0 + \beta_1) \ = \ \beta_2 + \beta_3
- The DiD is the difference of the groups’ price changes over time:
  [(lprice_{81} \mid nearinc = 1) \ - \ (lprice_{78} \mid nearinc = 1)] \ -
  [(lprice_{81} \mid nearinc = 0) \ - \ (lprice_{78} \mid nearinc = 0)] \ =
  (\beta_2 + \beta_3) \ - \ \beta_2 \ = \ \beta_3

Hence, the coefficient of the interaction term, \beta_3, is equal to the classical DiD estimator

On second thoughts, this result should not be too surprising. We know from Section 5.2 that interaction terms generally gives the difference in slopes compared with the reference category ⁴. In this case, we have the effect of the time dummy y81, which shows the increase over time for the control group, and the interaction with nearinc shows the additional time effect (increase or decrease) caused by nearinc. And that is the treatment effect we are looking for.

Let’s estimate the model with an interaction term as outlined above:

out <- lm( lprice ~ nearinc + y81 + nearinc:y81, data = hprice3 )
# summary(out)

Code

modelsummary(list("DiD"=out), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")

Table 5.11: DiD regression with interaction term

	DiD
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	11.285***	0.031	369.839	<0.001	11.225	11.345
nearinc	-0.340***	0.055	-6.231	<0.001	-0.447	-0.233
y81	0.457***	0.045	10.084	<0.001	0.368	0.546
nearinc × y81	-0.063	0.083	-0.751	0.453	-0.227	0.102
Num.Obs.	321
R2	0.409
RMSE	0.34
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

The estimated coefficient of the interaction variable is -0.063, which is identical to the classical difference in difference estimator in the example above

However, we now have estimated standard errors and t-values enabling statistical tests
- Actually, the found effect is not significant!
- Notice, the negative sign of nearinc which means that prices of these houses have been 34% lower long before the incinerator was build or even announced. Why?
We further check: \beta_2 = 0.457 is the price change of the control group (not near the incinerator). This is identically to D1 in the example above
We check: \beta_2 + \beta_3 = 0.394 is the price change of the treatment group (near the incinerator). This is identically to D2 in the example above
And D2-D1 gives -0.063, the DiD estimator of the example above, which is \beta_3

We estimate the same model as above, but this time with additional control variables: age of house, larea = log of house area, lland = log of area of the property and linstsq = log of distance to interstate².

These should control for the price differences before the incinerator was announced
And more importantly, the characteristics of the houses sold in 81 could have changed compared to 78 in a different way for houses near or not near the incinerator

out1 <- lm( log(price) ~ nearinc + y81 + nearinc:y81 + 
              larea + lland +  linstsq + age + agesq, data = hprice3 )
# summary(out1)

Code

modelsummary(list("DiD with controls"=out1), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")

Table 5.12: DiD regression with interaction term and control variables

	DiD with controls
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	6.690***	0.342	19.559	<0.001	6.017	7.363
nearinc	0.012	0.049	0.245	0.806	-0.084	0.108
y81	0.418***	0.029	14.195	<0.001	0.360	0.476
larea	0.509***	0.041	12.482	<0.001	0.429	0.590
lland	0.116***	0.025	4.546	<0.001	0.066	0.166
linstsq	-0.005*	0.002	-2.509	0.013	-0.008	-0.001
age	-0.012***	0.001	-9.292	<0.001	-0.014	-0.009
agesq	0.000***	0.000	7.180	<0.001	0.000	0.000
nearinc × y81	-0.151**	0.053	-2.849	0.005	-0.256	-0.047
Num.Obs.	321
R2	0.774
RMSE	0.21
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

This now leads to a significant effect of the interaction variable nearinc:y81 of about -15%.
Furthermore, nearinc in 78 (before the incinerator was build) is not significant any more as it should

Conclusion: Regressions with interaction terms have several advantages over a classical DiD estimators:

It enables tests, additional control variables, which affects the groups differently over time, and allows for continuous treatment variables instead of dummies
However, all DiD estimates will be biased and inconsistent if different trends (or other factors, like self-selection), which are not controlled for, are at work in the control and treatment groups
- This is especially relevant when the treatment of groups starts at different time periods (staggered timing)
  - and the treatment effects are not homogeneously effective for the groups and/or more importantly,
  - the treatment effects are not constant over time (change the trend or exhibit a dynamic pattern)
- For these advanced aspects see Goodman-Bacon (2021) or his very enlightening video https://www.youtube.com/watch?v=m1xSMNTKoMs
- Some possible solutions, notably in a panel setting, are proposed in Callaway and Sant’Anna (2021), Sun and Abraham (2021), Wooldridge (2021), Lee and Wooldridge (2023) and Gardner (2022)
  - Applying standard DiD estimators in a panel setting to data with staggered treatment timings can lead to a particular issue: comparing units treated at one point in time with other groups that were treated in some previous periods. This basically leads to a violation of the parallel trend assumption if treatment causes not only a level effect, but also a trend effect. The proposed solutions from some papers mentioned above want to avoid this specific type of comparison.
    On the other hand, some papers (especially those from Wooldridge) argue that a misspecification of the estimated model is the core of the problem which should be tackled with a full set of dummy/interaction variables.
    Both approaches generally leads to valid estimates.
- For a simple R implementation of an event study approach (which allows for dynamic effects but assumes homogeneous treatment effects), look at the video: https://www.youtube.com/watch?v=NmKNQdsLlh0
- In R, there are several specialized packages for doing generalized DiD estimates which allow for heterogeneous dynamic treatment effects and staggered timings; for example did implements the approach from Callaway and Sant’Anna (2021), did2s implements Butts and Gardner (2021), fixest with sunab() function does Sun and Abraham (2021) and etwfe realizes Wooldridge (2021).

Simulation study

Simulation study with several recently proposed DiD-estimation procedures with staggered timing under different scenarios

Defining a function which pads a vector with NAs:

Code

# Defining a function which pads a vector with NA
pad = function(x, ll) {
  if (is.vector(x)) {
    l = length(x)
    dl = ll-l 
    if (dl > 0) {
      dx = rep(NA, dl)
      x = c(x, dx)  
    }
    if (dl < 0) {
      x = x[1:ll]
    }
    return(x)
  }
  return("Error, x must be a vector!")
}

Code from did2s package, Butts (2021)

did2s estimation code, (changed output format and integrated the Wooldridge (2021) estimator into the code)
did2s plotting code, (added the aggregated true treatment effects to the plots, some minor output changes)

Code

###############################################################################
###############################################################################
###############################################################################

#' Estimate event-study coefficients using TWFE and 5 proposed improvements.
#'
#' @description Uses the estimation procedures recommended from Borusyak, Jaravel, Spiess (2021); Callaway and Sant'Anna (2020); Gardner (2021); Roth and Sant'Anna (2021); Sun and Abraham (2020)
#'
#' @param data The dataframe containing all the variables
#' @param yname Variable name for outcome variable
#' @param idname Variable name for unique unit id
#' @param tname Variable name for calendar period
#' @param gname Variable name for unit-specific date of initial treatment
#'   (never-treated should be zero or NA)
#' @param xformla A formula for the covariates to include in the model.
#'   It should be of the form `~ X1 + X2`. Default is NULL.
#' @param weights Variable name for estimation weights. This is used in
#'   estimating Y(0) and also augments treatment effect weights
#' @param estimator Estimator you would like to use. Use "all" to estimate all.
#'   Otherwise see table to know advantages and requirements for each of these.
#'
#' @return `event_study` returns a data.frame of point estimates for each estimator
#'
#' @examples
#' \donttest{
#' out = event_study(
#'   data = did2s::df_het, yname = "dep_var", idname = "unit",
#'   tname = "year", gname = "g", estimator = "all"
#' )
#' plot_event_study(out)
#' }
#' @export
event_study2 = function(data, yname, idname, gname, tname,
                       xformla = NULL, weights = NULL,
                       estimator = c("all", "TWFE", "did2s", "did", "impute", "sunab",
                                  "staggered", "Wooldridge"), xvar = NULL 
               ){

# Check Parameters -------------------------------------------------------------

    # Select estimator
    estimator <- match.arg(estimator)

    # Display message about estimator's different assumptions
    if(estimator == "all") {
        message("Note these estimators rely on different underlying assumptions. See Table 2 of `https://arxiv.org/abs/2109.05913` for an overview.")
    }

    # Test that gname is in tname or 0/NA for untreated
    if(!all(
            unique(data[[gname]]) %in% c(0, NA, unique(data[[tname]]))
        )) {
        stop(sprintf(
            "'%s' must denote which year treatment starts for each group. Untreated observations should have g = 0 or NA.",
            gname
        ))
    }

    # Test that there exists never-treated units
    if(!any(
        unique(data[[gname]]) %in% c(0, NA)
        )) {
        stop(
            "event_study only works when there is a never-treated groups. This will be updated in the future, though with fewer estimators."
        )
    }

    # If `xformla` is included, note
    if(!is.null(xformla)) {
        if(estimator %in% c("all", "staggered")) {
            message(paste0("Warning: `", xformla, "` is ignored for the `staggered` estimator"))
        }
    }

# Setup ------------------------------------------------------------------------

    # Treat
    data$zz000treat = 1 * (data[[tname]] >= data[[gname]]) * (data[[gname]] > 0)
    data[is.na(data$zz000treat), "zz000treat"] = 0

    # Set g to zero if NA
    data[is.na(data[[gname]]), gname] = 0

    # Create event time
    data$zz000event_time = ifelse(
        is.na(data[[gname]]) | data[[gname]] == 0 | data[[gname]] == Inf,
        -Inf,
        as.numeric(data[[tname]] - data[[gname]])
    )

    event_time = unique(data$zz000event_time)
    event_time = event_time[!is.na(event_time) & is.finite(event_time)]

    # Format xformla for inclusion
    if(!is.null(xformla)) {
        xformla_null = paste0("0 + ", as.character(xformla)[[2]])
    } else {
        xformla_null = "0"
    }



# initialize empty arguments
tidy_twfe = NULL
tidy_did2s = NULL
tidy_did = NULL
tidy_sunab = NULL
tidy_impute = NULL
tidy_staggered = NULL
tidy_etwfe = NULL

# TWFE -------------------------------------------------------------------------
if(estimator %in% c("TWFE", "all")) {
    message("Estimating TWFE Model")

    try({
        twfe_formula = stats::as.formula(
      paste0(
        yname, " ~ 1 + ", xformla_null, " + i(zz000event_time, ref = c(-1, -Inf)) | ", idname, " + ", tname
      )
    )

        est_twfe = fixest::feols(twfe_formula, data = data, warn = F, notes = F)

        # Extract coefficients and standard errors
        tidy_twfe = broom::tidy(est_twfe)

        # Extract zz000event_time
        tidy_twfe = tidy_twfe[grep("zz000event_time::", tidy_twfe$term), ]

        # Make event time into a numeric
        tidy_twfe$term = as.numeric(gsub("zz000event_time::", "", tidy_twfe$term))

        # Subset column
        tidy_twfe = tidy_twfe[, c("term", "estimate", "std.error")]
    })

    if(is.null(tidy_twfe)) warning("TWFE Failed")
}

# did2s ------------------------------------------------------------------------
if(estimator %in% c("did2s", "all")) {
    message("Estimating using Gardner (2021)")
  
    try({
        did2s_first_stage = stats::as.formula(
      paste0(
        "~ 0 + ", xformla_null, " | ", idname, " + ", tname
      )
    )

        est_did2s = did2s::did2s(data, yname = yname, first_stage = did2s_first_stage, second_stage = ~i(zz000event_time, ref=-Inf), treatment = "zz000treat", cluster_var = idname, verbose = FALSE)

        
        # Extract coefficients and standard errors
        tidy_did2s = broom::tidy(est_did2s)

        # Extract zz000event_time
        tidy_did2s = tidy_did2s[grep("zz000event_time::", tidy_did2s$term), ]

        # Make event time into a numeric
        tidy_did2s$term = as.numeric(gsub("zz000event_time::", "", tidy_did2s$term))

        # Subset columns
        tidy_did2s = tidy_did2s[, c("term", "estimate", "std.error")]
    })

    if(is.null(tidy_did2s)) warning("Gardner (2021) Failed")
}

# did --------------------------------------------------------------------------
if(estimator %in% c("did", "all")) {
    message("Estimating using Callaway and Sant'Anna (2020)")

    try({
        est_did = did::att_gt(yname = yname, tname = tname, idname = idname, gname = gname, xformla = xformla, data = data)

        est_did = did::aggte(est_did, type = "dynamic", na.rm = TRUE)

        # Extract es coefficients
        tidy_did = broom::tidy(est_did)

        # Subset columns
        tidy_did$term = tidy_did$event.time
        tidy_did = tidy_did[, c("term", "estimate", "std.error")]
    })

    if(is.null(tidy_did)) warning("Callaway and Sant'Anna (2020) Failed")
}

# sunab ------------------------------------------------------------------------
if(estimator %in% c("sunab", "all")) {
    message("Estimating using Sun and Abraham (2020)")

    try({
        # Format xformla for inclusion
        if(is.null(xformla)) {
            sunab_xformla = "1"
        } else {
            sunab_xformla = paste0("1 + ", as.character(xformla)[[2]])
        }

        sunab_formla = stats::as.formula(
      paste0(
        yname, " ~ ", sunab_xformla, " + sunab(", gname, ", zz000event_time, ref.c =0, ref.p = -1) | ", idname, " + ", tname
      )
    )

        est_sunab = fixest::feols(sunab_formla, data = data)

        tidy_sunab = broom::tidy(est_sunab)

    # Extract zz000event_time
        tidy_sunab = tidy_sunab[grep("zz000event_time::", tidy_sunab$term), ]

        # Make event time into a numeric
        tidy_sunab$term = as.numeric(gsub("zz000event_time::", "", tidy_sunab$term))

        # Subset columns
        tidy_sunab = tidy_sunab[, c("term", "estimate", "std.error")]
    })

    if(is.null(tidy_sunab)) warning("Sun and Abraham (2020) Failed")
}

# did_imputation ---------------------------------------------------------------
if(estimator %in% c("impute", "all")) {
    message("Estimating using Borusyak, Jaravel, Spiess (2021)")

    try({
        impute_first_stage = stats::as.formula(
      paste0(
        "~ 1 + ", xformla_null, "+ i(", tname, ") | ", idname
      )
    )

        tidy_impute = didimputation::did_imputation(data,
                                     yname = yname, gname = gname, tname = tname, idname = idname,
                                     first_stage = impute_first_stage, horizon = TRUE, pretrends = TRUE)

        # Subset columns
        tidy_impute = tidy_impute[, c("term", "estimate", "std.error")]

        tidy_impute = tidy_impute[grep("^(-)?[0-9]+$", tidy_impute$term), ]

        # Make event time into a numeric
        tidy_impute$term = as.numeric(tidy_impute$term)
    })

    if(is.null(tidy_impute)) warning("Borusyak, Jaravel, Spiess (2021) Failed")
}

# staggered --------------------------------------------------------------------
if(estimator %in% c("staggered", "all")) {
    # Waiting for staggered on CRAN
    message("Estimating using Roth and Sant'Anna (2021)")

    try({
        # Make untreated g = Inf
        data_staggered = data

        data_staggered[,gname] = ifelse(
            data_staggered[[gname]] == 0,
            Inf,
            data_staggered[[gname]]
        )

        event_time_staggered = event_time[is.finite(event_time) & event_time != -1]
        event_time_staggered = event_time_staggered[event_time_staggered != min(event_time_staggered) ]

        tidy_staggered = staggered::staggered(
            data_staggered,
            i = idname, t = tname, g = gname, y = yname, estimand = "eventstudy",
            eventTime = event_time_staggered
        )

        # Subset columns
        tidy_staggered$term = tidy_staggered$eventTime
        tidy_staggered$std.error = tidy_staggered$se

        tidy_staggered = tidy_staggered[, c("term", "estimate", "std.error")]
    })

    if(is.null(tidy_staggered)) warning("Roth and Sant'Anna (2021) Failed")
}

# etwfe ------------------------------------------------------------------------
if(estimator %in% c("Wooldridge", "all")) {
    message("Estimating using Wooldridge (2021)")
  
        
    try({
         # as etwfe and emfx are using formulas, you must use the true names for yname and idname
         # from the data set data !!!
         # In this particular case: yy and id
      
      
        try({ 
           if (is.null(xformla)) etwfe_formula = stats::as.formula("yy ~ 1")
             else etwfe_formula = stats::as.formula(paste0("yy ~ ", as.character(xformla)[[2]]))
       })

           
           est_etwfe = etwfe::etwfe(etwfe_formula , tvar = tname,  gvar = gname, vcov = ~id, data = data)

           # est_etwfe = etwfe::etwfe(fml = yy ~ xx , tvar = tname,  gvar = gname, vcov = ~id, data = data)
    
       est_etwfe = etwfe::emfx(est_etwfe, type = "event", post_only = FALSE) 

           # Extract es coefficients
           tidy_etwfe = broom::tidy(est_etwfe)

          # Subset columns
          tidy_etwfe$term = tidy_etwfe$event
          tidy_etwfe = tidy_etwfe[, c("term", "estimate", "std.error")]
    })

    if(is.null(tidy_etwfe)) warning("Wooldridge (20121) Fail")
}



# Bind results together --------------------------------------------------------

    # out = data.table::rbindlist(list(
    #   "TWFE" = tidy_twfe,
    #   "Gardner (2021)" = tidy_did2s,
    #   "Callaway and Sant'Anna (2020)" = tidy_did,
    #   "Sun and Abraham (2020)" = tidy_sunab,
    #   "Roth and Sant'Anna (2021)" = tidy_staggered,
    #   "Borusyak, Jaravel, Spiess (2021)" = tidy_impute,
    #.  "Wooldridge (2021)" = tidy_etwfe
    # ), idcol = "estimator")


out1 = data.table::rbindlist(list(
        "TWFE" = tidy_twfe
    ), idcol = "estimator")

out2 = data.table::rbindlist(list(
        "Gardner (2021)" = tidy_did2s
    ), idcol = "estimator")


out3 = data.table::rbindlist(list(
        "Callaway and Sant'Anna (2020)" = tidy_did
    ), idcol = "estimator")


out4 = data.table::rbindlist(list(
        "Sun and Abraham (2020)" = tidy_sunab
    ), idcol = "estimator")

out5 = data.table::rbindlist(list(
            "Roth and Sant'Anna (2021)" = tidy_staggered
    ), idcol = "estimator")


out6 = data.table::rbindlist(list(
        "Borusyak, Jaravel, Spiess (2021)" = tidy_impute
    ), idcol = "estimator")

out7 = data.table::rbindlist(list(
        "Wooldridge (2021)" = tidy_etwfe
    ), idcol = "estimator")

out=list(out1, out2, out3, out4, out5, out6, out7)

    
return(out)

}


###############################################################################
###############################################################################
###############################################################################


#' Plot results of [event_study()]
#' @param out Output from [event_study()]
#' @param separate Logical. Should the estimators be on separate plots? Default is TRUE.
#' @param horizon Numeric. Vector of length 2. First element is min and
#'   second element is max of event_time to plot
#'
#' @return `plot_event_study` returns a ggplot object that can be fully customized
#'
#' @rdname event_study
#'
#' @importFrom rlang .data
#' @export


plot_event_study2 = function(out, separate = TRUE, horizon = NULL, truevalue = NULL) {

  if ( is.null(truevalue) ) truevalue = rep(0, length(out$term))
    
  # Get list of estimators
    estimators = unique(out$estimator)
    

    # Subset factor levels
    levels = c("TWFE", "Borusyak, Jaravel, Spiess (2021)", "Callaway and Sant'Anna (2020)", "Gardner (2021)", "Roth and Sant'Anna (2021)",  "Sun and Abraham (2020)", "Wooldridge (2021)")
    
    levels = levels[levels %in% estimators]
    
    

    # Make estimator into factor
    out$estimator = factor(out$estimator, levels = levels)


    # Subset color scales
    color_scale = c("TWFE" = "#374E55", "Gardner (2021)" = "#DF8F44", "Callaway and Sant'Anna (2020)" = "#00A1D5", "Sun and Abraham (2020)" = "#B24745", "Roth and Sant'Anna (2021)" = "#79AF97", "Borusyak, Jaravel, Spiess (2021)" = "#6A6599", "Wooldridge (2021)" = "#DA8A86")
    color_scale = color_scale[names(color_scale) %in% estimators]


    # create confidence intervals (90%)
    out$ci_lower = out$estimate - 1.65 * out$std.error
    out$ci_upper = out$estimate + 1.65 * out$std.error


    # position depending on separate
    if(separate) position = "identity" else position = ggplot2::position_dodge(width = 0.5)

    # Subset plot if horizon is specified
    if(!is.null(horizon)) {
        out = out[out$term >= horizon[1] & out$term <= horizon[2], ]
    }

    # max and min of limits
    y_lims = c(min(out$ci_lower, truevalue), max(out$ci_upper, truevalue)) * 1.05
    x_lims = c(min(out$term, truevalue) - 1, max(out$term, truevalue) + 1)

    
    ll = length(out$term)
    
    
    ggplot(data = out,
      mapping = aes(
        x = out$term, y = out$estimate,
        color = out$estimator,
        ymin = out$ci_lower, ymax = out$ci_upper
      )
        ) +
        { if(separate) facet_wrap(~ estimator, scales="free") } +
        geom_point(size=2.3) +
      geom_line(linewidth=0.55) + 
        #geom_errorbar() +
    geom_ribbon(aes(ymin = out$ci_lower, ymax = out$ci_upper), color = "lightgrey", alpha = 0.1) +
        geom_vline(xintercept = -0.5, linetype = "dashed") +
        geom_hline(yintercept = 0, linetype = "dashed") +
        labs(y = "Point Estimate, 95% CI and true Effects", x = "Event Time", color = "Estimator") +
        { if(separate) scale_y_continuous(limits = y_lims) } +
        { if(separate) scale_x_continuous(limits = x_lims) } +
        theme_minimal(base_size = 16) +
        scale_color_manual(values = color_scale) +
        theme(legend.position = "none", axis.title.y=element_blank()) + 
      geom_line( aes(x=0:(length(out$term)-1), y=pad(truevalue, length(out$term))), color = "#FF0000", linewidth=0.55, linetype = "dashed"  ) 

}

Code for:

Function for simulating data with staggered treatments, individual heterogeneity, individual treatment effects in levels plus magnitude and random noise and
Function for calculating the true treatment effect of the simulated data
Function for plotting of the simulated data

Code

###############################################################################
#### Start of simulating data ####
###############################################################################
simdata = function(gn, n, T, xbar, ybar, end_year, first_year_treated, scenario, s_all, beta_x=0) {
  
##### Initialization of vectors which are generated during simulation #####
year <- vector(length = T)
id <- vector(length = T)
g_i <- vector(length = T)

ybase <- vector(length = T)
yy <- vector(length = T)
y_trueeff <- vector(length = T)

ybase_bar <- vector(length = T)
yy_bar <- vector(length = T)
y_trueeff_bar <- vector(length = T)

first.treated <- vector(length = T)
treat <- vector(length = T)
now.treated <- vector(length = T)

event_time <-  vector(length = T)

xx <- vector(length = T)  
##### End initialization of vectors over t for dat_g data frame #####  


##### Begin simulation code #####
fixg=0                               # currently unused
fixtime = (runif(T)-0.5)/4           # time specific fixed effect, max +- 18% 
for (g in (1:gn)) {   # defined by scenarios

  first.treated = first_year_treated[g]   # year of first treatment for unit i at t
  treat = ever_treated[g]                 # if ever treated
  
  rg <- runif(1)*2               # g-specific xx-trend form [0, 2]
  if (g==1) rg <- 1              # for untreated group always 1
  
  for (i in 1:n) {
    
    if (scenario <= length(s_all)) {
      sg = s_all[[scenario]][g,]           # choice of the predefined scenario (if not purely random)
    }
    else  {
      sg = s_all[[sample(1:length(s_all), 1)]][g,]  # individual-specific random scenario
    }
    
    fixi =  (runif(1)-0.5)/4       # individual fixed effect, max +- 13%  
    fixgf = (runif(1)-0.5)/5       # individual fixed effect for effects, max +- 10% 
    fixgt = 1+(runif(1)-0.5)/1.5   # multiplicative individual fixed effect for effect, max +- 33% 
 
    for (t in 1:T) {
         
      # baseline values
      id[t] = i + (g-1)*n                   # unit id
      year[t] = start_year_minus_1 + t      # years 
      g_i[t] = g                            # group index
      xx[t] = rnorm(1, rg*xbar[t], sd_x)    # control variable xx, different, group-specific trend compared to ybase 
      now.treated[t] = 0                    # unit is treated at time t, default 0 
      
      first.treated[t] = first_year_treated[g]   # year of first treatment for unit i at t
      treat[t] = ever_treated[g]                 # if ever treated
      
      # if ever treated: event_time = relative time 
      event_time[t] = ifelse(treat[t]==1, year[t] - first.treated[t], NA)  
      
      ybase[t] = beta_x*xx[t] + rnorm(1, ybar[t], sd_y) + fixi + fixtime[t] # plus individual fixed effect
      ybase_bar[t] = beta_x*xx[t] + rnorm(1, ybar[t], 0.000001) + fixi + fixtime[t]  # same without random noise
      
      s = sg   # resetting s in t-loop 
          
      if ( sg[t] != 0 ) {                                # if treated according scenario 
        now.treated[t] = 1                               # unit i is now treated
        s = s * fixgt + fixgf          # individual heterogeneous treatment effects
        yy[t] = ybase[t] + s[t] + rnorm(1, 0, sd_eff)   # treatment + random noise  
      }
      else {
        yy[t] = ybase[t]   # not treated: yy = base 
      }
          
      y_trueeff[t] = yy[t] - ybase[t]
          
      yy_bar[t] = ybase_bar[t] + s[t]              # without random noise  
      y_trueeff_bar[t] = yy_bar[t] - ybase_bar[t]  # without random noise = s[t] with individual effects
         
      }   # end t
    
      ## working, but now better in time-loop 
      # event_time = rep(NA, T)
        # if( g>1 ) {
        #   event_time = rev( seq(end_year - first_year_treated[g], by=-1, length.out = T) )
        # } 
      
    # first unit -> initialing dat_g
    if (i==1 & g==1) { 
      dat_g <- data.frame(id, year, g_i, yy, first.treated, treat, event_time, ybase, y_trueeff,
                          yy_bar, y_trueeff_bar, now.treated, xx)
    }
    else {
      dati <-  data.frame(id, year, g_i, yy, first.treated, treat, event_time, ybase, y_trueeff,
                          yy_bar, y_trueeff_bar, now.treated, xx)
      dat_g <- rbind(dat_g, dati)
    }
  }    # end i
}   # end g

return(dat_g)

}  # end function

###############################################################################
#### End of simulating data ####
###############################################################################



###############################################################################
#### Calculating the mean (over units) true treatment effects ####
###############################################################################
# Need T and y_no_tr, which is the number of years before treatment, e.g. c(NA  2  4  6  9)
mean_treatment = function(dat_g, y_no_tr, T, graf=1, print=1) {

for (i in 2:gn) {
  true_eff_vec_g = dat_g$y_trueeff[(dat_g$g_i==i)]   # vector of true effects of all units in group g
  true_eff_mat_g = matrix(true_eff_vec_g, nrow = T)  # matrix per unit (over t)
  true_eff_mat_g = true_eff_mat_g[ (y_no_tr[i]+1):nrow(true_eff_mat_g), ]  # cut off non treated years
  true_eff_g = rowMeans(true_eff_mat_g)  # average over units per time 
  if (i==2) {
    true_eff_g_mat_ = true_eff_g
    ll = length(true_eff_g)
  }
  else {
    true_eff_g_mat_ = rbind( true_eff_g_mat_, pad( true_eff_g, ll) )
  }
}

#group meas over all groups
true_eff_g = colMeans(true_eff_g_mat_, na.rm = TRUE) # average over group means


# Alternative calculation of average effects with `aggregate()`
true_eff_g_aggregate = aggregate(dat_g$y_trueeff, list(dat_g$event_time), subset=(dat_g$y_trueeff!=0), FUN=mean)


# true effects (without random effects) from used scenarios 
if (graf==1) {
  
  plot( ggplot(mapping = aes( x = 1:length(s_eff_all[[scenario]]), 
                                     y = s_eff_all[[scenario]] ) ) + 
          geom_point(size=1.5, colour="#999999") + 
          geom_line(linewidth=0.5, colour="#999999") + 
          ylim( min(s_eff_all[[scenario]], 0), max(0.1, s_eff_all[[scenario]]) ) +
          xlab("Periods") +
          theme_bw( ) ) 
  
   plot( ggplot(mapping = aes(x = 1:length(true_eff_g), 
                                     y = true_eff_g) )  +  
          geom_point(size=1.5, colour="#999999") + 
          geom_line(linewidth=0.5, colour="#999999") + 
          xlab("Periods") +
          ylim( min(true_eff_g, 0), max(0.1, true_eff_g) ) + 
          theme_bw( ) ) 
   
}


if ( round(true_eff_g[length(true_eff_g)], 8) != round(true_eff_g_aggregate$x[length(true_eff_g_aggregate$x)],8) )  {
  print("2 alternative calculation of average treatment effects")
  print("  ")
  print(true_eff_g)
  print("  ")
  print(true_eff_g_aggregate$x)
}


return(true_eff_g)
} 
###############################################################################
#### End calculating the mean (over units) true treatment effects ####
###############################################################################



###############################################################################
#### Plotting simulated data ####
###############################################################################
plot_sim_data = function(dat_g, graf1=1, graf2=1) {

  
  if (graf1==1) { 
    plot( ggplot(dat_g, mapping = aes(x = year, y = yy_bar, group = id)) +  
      geom_line( aes(color=id) ) +
      stat_summary(aes(y = yy_bar, group=g_i), fun=mean, colour="red", geom="line", linewidth=2) + 
      theme_bw() )

    plot( ggplot(dat_g, mapping = aes(x = year, y = y_trueeff_bar, group = id)) +  
      geom_line( aes(color=id) ) + 
      stat_summary(aes(y = y_trueeff_bar, group=g_i), fun=mean, colour="red", geom="line", linewidth=2) + 
      theme_bw() ) 
  }

  if (graf2==1) { 
    plot( ggplot(dat_g, mapping = aes(x = year, y = yy, group = id)) +  
      geom_line( aes(color=id) ) + 
      stat_summary(aes(y = yy, group=g_i), fun=mean, colour="red", geom="line", linewidth=2) + 
      theme_bw() )

    plot( ggplot(dat_g, mapping = aes(x = year, y = y_trueeff, group = id)) +  
      geom_line( aes(color=id) ) + 
      stat_summary(aes(y = y_trueeff, group=g_i), fun=mean, colour="red", geom="line", linewidth=2) + 
      theme_bw() )
  }
}
###############################################################################
#### End plotting simulated data ####
###############################################################################

Basic Parameters:

Code

set.seed(4334)

library(did2s)
library(ggplot2)
library(bacondecomp) # Goodman-Bacon

# Groups are defined by first year of treatment, within group always homogeneous but noisy
# 1st group untreated, group

############## Choose basic parameters ########################################
n <- 50 # number of units within each group
gn <- 5  # max 8, n>gn must be true

start_year = 2001  # start year 2001, because t-loop begins with 1
end_year <-  2015  # max 2019 for several scenarios

# Max 8 groups with these treatment starts
# (You can change the treatment starts, group is always 0, untreated)
# (But don't change the order)
first_year_treated = c(0, 2003, 2005, 2007, 2010, 2015, 2016, 2018)

########## The above imply: (don't change these commands)
   start_year_minus_1 <- start_year - 1
   T <- end_year - start_year_minus_1  # number of years
##########

# Variances of unit shocks
sd_x   = 0.03  # sd of x, y and effects
sd_y   = 0.03
sd_eff = 0.03

# trend of ybase -> parallel trends
ybar = ( 0:(T-1) )/50

# mean of mean[t] for xx
beta_x <-  0.5  # dependency of y on control variable x
xbar = ( 0:(T-1) )/50
############## End choose basic parameters ####################################


# Values for all gn groups
first_year_treated = first_year_treated[1:gn]

ever_treated = c(0, rep(1, gn-1))

# number of years of no treatment at the begin
y_no_tr = NA
for (i in 2:gn) {
  y_no_tr = c(y_no_tr, first_year_treated[i] - start_year)
}

Scenarios:

Code

# 7 different scenarios

# for never treated group 1 always 0: 
# seq(0,0,length.out=T)

###############################################################################
# 1) homogeneous constant 
s_1 =  seq(0,0,length.out=T)
for (i in 2:gn) {
  s_1 = rbind(s_1, c(seq(0,0,length.out=y_no_tr[i]), rep(0.08, T-y_no_tr[i]) ) )
}
s1_eff_all = rep(0.1, T-y_no_tr[2]) # for calculating true treatment effect

###############################################################################
## 2) heterogeneous constant
possilbe_values = sample( c(0.1, 0.12, 0.03, 0.05, 0.08, 0.06, 0.02), gn-1 )

s_2 = seq(0,0,length.out=T)
l_eff = T-y_no_tr[2] # length of _eff vectors

s2_eff = seq(0,0,length.out=l_eff)
r=2:gn
for (i in r) {
  s__2 = rep( possilbe_values[i-1], T) 
  #plot(s__2)
  #abline(0,0)
  s2_eff = rbind( s2_eff, pad( s__2[1:(T-y_no_tr[i])],  l_eff) )
  
  s_2 = rbind( s_2, c(seq(0,0,length.out=y_no_tr[i]), s__2[1:(T-y_no_tr[i])] ) )
}

s2_eff_all = colMeans(s2_eff[2:gn,], na.rm = TRUE)
###############################################################################
## 3) homogeneous trend 
# s_3_st is the standard trend also used in other scenarios
s_3_st = seq(0.1, by=0.1, length.out=T)/10

s_3 = seq(0,0,length.out=T)
for (i in 2:gn) {
  s_3 = rbind(s_3, c(seq(0,0,length.out=y_no_tr[i]), s_3_st[1:(T-y_no_tr[i])]))
}
s3_eff_all = s_3_st[1:(T-y_no_tr[2])]
###############################################################################
## 4) heterogeneous trend 
possilbe_values = sample( c(2,  1/1.3,  1.3,  2.2,  1/1.5,  1.5,  1/2), gn-1 )

s_4 = seq(0,0,length.out=T)
l_eff = T-y_no_tr[2] # length of _eff vectors

s4_eff = seq(0,0,length.out=l_eff)
r=2:gn
for (i in r) {
  s__4 = possilbe_values[i-1] * s_3_st 
  #plot(s__4)
  #abline(0,0)
  s4_eff = rbind( s4_eff, pad( s__4[1:(T-y_no_tr[i])],  l_eff) )
  
  s_4 = rbind( s_4, c(seq(0,0,length.out=y_no_tr[i]), s__4[1:(T-y_no_tr[i])] ) )
}

s4_eff_all = colMeans(s4_eff[2:gn,], na.rm = TRUE)
###############################################################################
## 5)  homogeneous dynamics 
s_5_st = c(0.2, 0.5, 0.8, 1.1, 1.3, 1.3, 1.2, 1.1, 0.95, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.56, 0.54, 0.52, 0.50, 0.48, 0.46, 0.44, 0.42, 0.40, 0.38, 0.36, 0.35, 0.34, 0.33, 0.32, 0.31, 0.30)/10

s_5 = seq(0,0,length.out=T)
for (i in 2:gn) {
  s_5 = rbind(s_5, c(seq(0,0,length.out=y_no_tr[i]), s_5_st[1:(T-y_no_tr[i])]))
}

s5_eff_all = s_5_st[1:(T-y_no_tr[2])]
###############################################################################
## 6) complete heterogeneous dynamics 
possilbe_values = sample( c(1.0,  0.0,  -0.2,  -0.5,  0.6,  0.2,  1.3), gn-1 )

s_6 = seq(0,0,length.out=T)
l_eff = T-y_no_tr[2] # length of _eff vectors

s6_eff = seq(0,0,length.out=l_eff)
r=2:gn
for (i in r) {
  s__6 = possilbe_values[i-1] * s_3_st  + (1-possilbe_values[i-1]) * s_5_st[1:length(s_3_st)]
  #plot(s__6)
  #abline(0,0)
  s6_eff = rbind( s6_eff, pad( s__6[1:(T-y_no_tr[i])],  l_eff) )
  
  s_6 = rbind( s_6, c(seq(0,0,length.out=y_no_tr[i]), s__6[1:(T-y_no_tr[i])] ) )
}

s6_eff_all = colMeans(s6_eff[2:gn,], na.rm = TRUE)
###############################################################################
## 7) random walk with drift
possilbe_values = sample( c(0.00015,  0.0002,  0.0003,  0.00006,  0.0001,  0.00025,  0.00035), gn-1 ) # for drift

s_7 =  seq(0,0,length.out=T)
l_eff = T-y_no_tr[2] # length of _eff vectors

s7_eff = seq(0,0,length.out=l_eff)
r=2:gn
for (i in r) {
  rw = cumsum(rnorm(T, 0, 1/150))  # random walk
  dr = sqrt(cumsum( rep( possilbe_values[i-1], T) ) )  # drift
  s__7 = rw + dr
  #plot(s__7)
  #abline(0,0)
  s7_eff = rbind( s7_eff, pad( s__7[1:(T-y_no_tr[i])],  l_eff) )
  
  s_7 = rbind( s_7, c(seq(0,0,length.out=y_no_tr[i]), s__7[1:(T-y_no_tr[i])] ) )
}

s7_eff_all = colMeans(s7_eff[2:gn,], na.rm = TRUE)
###############################################################################

# List of all scenarios
s_all = list(s_1, s_2, s_3, s_4, s_5, s_6, s_7)

# List of all scenarios average effects
s_eff_all = list(s1_eff_all, s2_eff_all, s3_eff_all, s4_eff_all, s5_eff_all,
                  s6_eff_all, s7_eff_all)


for ( i in 2:length(s_all) ) {
  matplot(t(s_all[[i]]), type="l")
}


###############################################################################
## 8) complete heterogeneous  

#  s_all[[sample(1:7,1)]][sample(2:gn,1),]  # works


# Default scenario
scenario = 3

(a) Scenario 2 - heterogeneous constants

###############
scenario = 2 
###############

Figure 5.5

Code

## Simulating data
dat_g = simdata(gn, n, T, xbar, ybar, end_year, first_year_treated, scenario, s_all, beta_x=0.0)


## Calculating true treatment effects
true_eff_g <- mean_treatment(dat_g, y_no_tr=y_no_tr, T=T, scenario, graf=1)


# Plotting data
plot_sim_data(dat_g, graf1=0)

(a) Theoretical scenario specific average treatment effect

Code

```{r}
#| code-fold: TRUE
#| label: fig-Goodman-Bacon1
#| fig-cap: Goodman-Bacon Decomposition
#| fig-height: 2.6
#| fig-width: 5
#| fig-align: center

## Goodman-Bacon Decomposition
average_eff = mean(dat_g$y_trueeff[dat_g$y_trueeff!=0])

fit_tw <- lm(yy ~ now.treated + factor(id) + factor(year), 
             data = dat_g)

static <- did2s(dat_g, yname = "yy", first_stage = ~ 0 | id + year, 
                second_stage = ~ now.treated, treatment = "now.treated", 
                cluster_var = "g_i")

#fixest::etable(static)

print("Estimates of static treatment effect:")
print(paste("Two-way FE estimate =", round(fit_tw$coefficients[2], 4)))
print(paste("Gardner(2021) =", round(static$coeftable[1], 4)))
print(paste("True average effect =", round(average_eff, 4)))
print("  ")
print("  ")

df_bacon <- bacon(yy ~ now.treated,
                   data = dat_g,
                   id_var = "id",
                   time_var = "year", quietly = T)

ggplot(df_bacon) +
  aes(x = weight, y = estimate, color = factor(type)) +
  labs(x = "Weight", y = "Estimate") +
  geom_point(size=3) + 
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_bw()
```

[1] "Estimates of static treatment effect:"
[1] "Two-way FE estimate = 0.0692"
[1] "Gardner(2021) = 0.0841"
[1] "True average effect = 0.0878"
[1] "  "
[1] "  "

Code

## Estimation
out = event_study2(
  data = dat_g, yname = "yy", idname = "id",
  tname = "year", gname = "first.treated", estimator = "all"
)

# xformla = formula(~ xx)

## plotting estimation results

for (i in 1:7) {
  plot( plot_event_study2(out[[i]], horizon = c(-5, 15), truevalue = true_eff_g) )
}

rm(dat_g)

(a) Estimated treatment effects, 90% CI and true effects (red-dashed)

###############
scenario = 6
###############

Figure 5.5

Code

## Simulating data
dat_g = simdata(gn, n, T, xbar, ybar, end_year, first_year_treated, scenario, s_all)


## Calculating true treatment effects
true_eff_g <- mean_treatment(dat_g, y_no_tr=y_no_tr, T=T, scenario, graf=1)


# Plotting data
plot_sim_data(dat_g, graf1=0)

Code

```{r}
#| code-fold: TRUE
#| label: fig-Goodman-Bacon3
#| fig-cap: Goodman-Bacon Decomposition
#| fig-height: 2.6
#| fig-width: 5
#| fig-align: center

## Goodman-Bacon Decomposition
average_eff = mean(dat_g$y_trueeff[dat_g$y_trueeff!=0])

fit_tw <- lm(yy ~ now.treated + factor(id) + factor(year), 
             data = dat_g)

static <- did2s(dat_g, yname = "yy", first_stage = ~ 0 | id + year, 
                second_stage = ~ now.treated, treatment = "now.treated", 
                cluster_var = "g_i")

#fixest::etable(static)

print("Estimates of static treatment effect:")
print(paste("Two-way FE estimate =", round(fit_tw$coefficients[2], 4)))
print(paste("Gardner(2021) =", round(static$coeftable[1], 4)))
print(paste("True average effect =", round(average_eff, 4)))
print("  ")
print("  ")

df_bacon <- bacon(yy ~ now.treated,
                   data = dat_g,
                   id_var = "id",
                   time_var = "year", quietly = T)

ggplot(df_bacon) +
  aes(x = weight, y = estimate, color = factor(type)) +
  labs(x = "Weight", y = "Estimate") +
  geom_point(size=3) + 
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_bw()
```

[1] "Estimates of static treatment effect:"
[1] "Two-way FE estimate = 0.0372"
[1] "Gardner(2021) = 0.0759"
[1] "True average effect = 0.0717"
[1] "  "
[1] "  "

Figure 5.10: Goodman-Bacon Decomposition

Code

## Estimation
out = event_study2(
  data = dat_g, yname = "yy", idname = "id",
  tname = "year", gname = "first.treated", estimator = "all"
)

## plotting estimation results

for (i in 1:7) {
  plot( plot_event_study2(out[[i]], horizon = c(-5, 15), truevalue = true_eff_g) )
}

rm(dat_g)

###############
scenario = 7 
## Parallel trend assumption violated ##
###############

Figure 5.5

Code

## Simulating data
dat_g = simdata(gn, n, T, xbar, ybar, end_year, first_year_treated, scenario, s_all, beta_x = 0.5)


## Calculating true treatment effects
true_eff_g <- mean_treatment(dat_g, y_no_tr=y_no_tr, T=T, scenario, graf=1)


# Plotting data
plot_sim_data(dat_g, graf1=0)

Code

```{r}
#| code-fold: TRUE
#| label: fig-Goodman-Bacon4
#| fig-cap: Goodman-Bacon Decomposition
#| fig-height: 2.6
#| fig-width: 5
#| fig-align: center

## Goodman-Bacon Decomposition
average_eff = mean(dat_g$y_trueeff[dat_g$y_trueeff!=0])

fit_tw <- lm(yy ~ now.treated + factor(id) + factor(year), 
             data = dat_g)

static <- did2s(dat_g, yname = "yy", first_stage = ~ 0 | id + year, 
                second_stage = ~ now.treated, treatment = "now.treated", 
                cluster_var = "g_i")

#fixest::etable(static)

print("Estimates of static treatment effect:")
print(paste("Two-way FE estimate =", round(fit_tw$coefficients[2], 4)))
print(paste("Gardner(2021) =", round(static$coeftable[1], 4)))
print(paste("True average effect =", round(average_eff, 4)))
print("  ")
print("  ")

df_bacon <- bacon(yy ~ now.treated,
                   data = dat_g,
                   id_var = "id",
                   time_var = "year", quietly = T)

ggplot(df_bacon) +
  aes(x = weight, y = estimate, color = factor(type)) +
  labs(x = "Weight", y = "Estimate") +
  geom_point(size=3) + 
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_bw()
```

[1] "Estimates of static treatment effect:"
[1] "Two-way FE estimate = 0.0182"
[1] "Gardner(2021) = -0.0087"
[1] "True average effect = 0.035"
[1] "  "
[1] "  "

Figure 5.13: Goodman-Bacon Decomposition

Code

## Estimation
out = event_study2(
  data = dat_g, yname = "yy", idname = "id",
  tname = "year", gname = "first.treated", estimator = "all"
)

## plotting estimation results

for (i in 1:7) {
  plot( plot_event_study2(out[[i]], horizon = c(-5, 15), truevalue = true_eff_g) )
}

# rm(dat_g)

###############
scenario = 7 
## Parallel trend assumption violated ##
## With control variable xx that explains the different trends ##
###############

Figure 5.5

Code

## Estimation
out = event_study2(
  data = dat_g, yname = "yy", idname = "id",
  tname = "year", gname = "first.treated", estimator = "all", xformla = formula(~ xx)
)

## plotting estimation results

for (i in 1:7) {
  plot( plot_event_study2(out[[i]], horizon = c(-5, 15), truevalue = true_eff_g) )
}

rm(dat_g)

5.6 Chow Test

Returning to a previous example, Table 5.7, about the effect of gender on the received wage, it is very interesting to note that the specification with interactions of female with all other variables is identical to separately estimating two models, one each for men and woman. Below, the results for these two independent estimations are shown in Table 5.13 together with the results from Table 5.7.

These two separate estimates for the subgroups man and woman yield the very same parameter estimates as a common estimation but with all the interaction terms
- In particular, the results for the man-equation are identical to the results with the full set of interaction variables, and the coeficients of the women-equation are identical to the men-results plus the corresponding interaction terms
The F-test previously carried out regarding all interaction effects below Table 5.7 could also be accomplished with the two separately estimated regressions with the test statistic Equation 3.15

F \ = \ \frac{\left(\hat{\mathbf{u}}_{r}^{\prime} \hat{\mathbf{u}}_{r}-\hat{\mathbf{u}}_{ur}' \hat{\mathbf{u}}_{ur} \right) / m}{\hat{\mathbf{u}}'_{ur} \hat{\mathbf{u}}_{ur} /(n-k-1)} \ \sim \ F(m, n-k-1)

In this setup, the unrestricted SSR is given by the sum of the SSRs of the two separate regressions. Furthermore, the number of restrictions is: m = (k + 1), 6 in our case, because the 6 parameters of the two regressions each have to be equal if there is no difference between the subgroups. The number of freely estimated parameters is 12

# estimation of the model only for men with the subset option in lm()
out3 <- lm(lwage ~ educ + exper + I(exper^2) + tenure + I(tenure^2), 
           data = wage1, 
           subset = female==0)

# estimation of the model only for woman with the subset option in lm()
out4 <- lm(lwage ~ educ + exper + I(exper^2) + tenure + I(tenure^2), 
           data = wage1, 
           subset = female==1)

Code

modelsummary( list("Men "=out3, "Woman"=out4, "Both"=out2), 
              output="gt", 
              statistic = "statistic",
              gof_omit = "A|B|L|F", 
              align = "lddd", 
              stars = TRUE, 
              fmt = 3)

Table 5.13: Wage equation for men, woman and both with full set of interactions

	Men	Woman	Both
(Intercept)	0.215	0.323*	0.215+
	(1.629)	(2.308)	(1.659)
educ	0.087***	0.073***	0.087***
	(9.700)	(7.002)	(9.881)
exper	0.040***	0.017*	0.040***
	(5.603)	(2.581)	(5.708)
I(exper^2)	-0.001***	-0.000*	-0.001***
	(-4.864)	(-2.529)	(-4.955)
tenure	0.033***	0.039***	0.033***
	(3.628)	(3.349)	(3.696)
I(tenure^2)	-0.001+	-0.001**	-0.001*
	(-1.945)	(-2.801)	(-1.981)
female			0.108
			(0.561)
female × educ			-0.014
			(-1.034)
female × exper			-0.023*
			(-2.340)
female × I(exper^2)			0.000+
			(1.794)
female × tenure			0.007
			(0.451)
female × I(tenure^2)			-0.001
			(-1.443)
Num.Obs.	274	252	526
R2	0.446	0.260	0.461
RMSE	0.40	0.38	0.39
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Carrying out a F-test with formula from Equation 3.15

```{r}
#| comment: "    "

# Running a single regression with the implicit assumption that there are 
# no differences in the parameters values between man and woman, restricted model

out1 <- lm(log(wage) ~   educ + exper + I(exper^2) + tenure + I(tenure^2), 
           data = wage1)

ssr_r <- sum(out1$residuals^2) # SSR_r of the restricted model
n <- length(out1$residuals) # number of observations



# SSRs for the two separate, unrestricted equations from out3 and out4
ssr3 <- sum(out3$residuals^2);  ssr4<- sum(out4$residuals^2)  

# SSR_ur is the sum of the SSRs of the two separate, unrestricted equations
ssr_ur <- ssr3 + ssr4 



# F test statistic; m = k+1 = 6; in sum we have 2*6 parameters to estimate
F <- ( (ssr_r - ssr_ur)/6 ) / ( (ssr_ur)/(n-2*6) )


# prob-value of F statistic: Plug in the values for F, df1 and df2 into 
# the F-distribution function  
pval <- df(F, 6, n-2*6) 

"Result of F-test, whether all coefficients for man and woman are equal"
paste("F-statistic =", round(F,3), "   p-value =", pval)
```

     [1] "Result of F-test, whether all coefficients for man and woman are equal"
     [1] "F-statistic = 14.997    p-value = 1.8696197671248e-15"

As expected, the F- test statistic is identical to the one where we tested for the exclusion of all variables containing female, see below Table 5.7
This version of the test is called Chow Test. The Chow test is a general test to verify whether subgroups of the sample, (e.g., men/woman, married/unmarried or other characteristics) follow the same population relationship - or the same Data generating process
Originally, the Chow test was intended for detection of structural breaks in time series relationships, e.g. Phillips curves before and after Brexit. With time series, carrying out multiple Chow tests for different periods one can try to detect the point of time of a structural break. This is called Chow breakpoint test
But as we have seen, it is not necessary to estimate separate equations for every subgroup of the sample in carrying out this test
- Rather, it is sufficient to include the appropriate dummy variable, female in our example (or time after Brexit), and the full set of interaction terms with that dummy variable to allow for changes in the intercept, as well as in all slopes

5.7 Categorical data in R

In R, factors are an extension of dummy variables designed for storing categorical information especially, if there are more than two categories
Internally, a factor variable in R consist of a vector of integers with values 1 to k, and a character vector, called levels, which is stored as an attribute of the variable and contains the strings for the labels of the k categories
Using factors in R have several advantages
- When encountering a factor, R knows that this variable contains categorical information and can choose appropriate methods automatically, for instance for plotting and printing
- Furthermore, many packages supports factor variables, e.g. when calculating APEs
- If we have k categories, for instance for different occupations, these k categories are stored in a single factor variable. Using this factor in a regression, R automatically expands this factor into k-1 dummy variables
To define a dummy variable or a character vector, which contain the categories, as a factor variable simply use the function factor( varname )

Example data set

library(AER)
library(wooldridge)
data("CPS1985")
str(CPS1985)

     'data.frame':  534 obs. of  11 variables:
      $ wage      : num  5.1 4.95 6.67 4 7.5 ...
      $ education : num  8 9 12 12 12 13 10 12 16 12 ...
      $ experience: num  21 42 1 4 17 9 27 9 11 9 ...
      $ age       : num  35 57 19 22 35 28 43 27 33 27 ...
      $ ethnicity : Factor w/ 3 levels "cauc","hispanic",..: 2 1 1 1 1 1 1 1 1 1 ...
      $ region    : Factor w/ 2 levels "south","other": 2 2 2 2 2 2 1 2 2 2 ...
      $ gender    : Factor w/ 2 levels "male","female": 2 2 1 1 1 1 1 1 1 1 ...
      $ occupation: Factor w/ 6 levels "worker","technical",..: 1 1 1 1 1 1 1 1 1 1 ...
      $ sector    : Factor w/ 3 levels "manufacturing",..: 1 1 1 3 3 3 3 3 1 3 ...
      $ union     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
      $ married   : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 1 1 2 1 ...

Attributes of a factor variable

```{r}
#| comment: "    "

attributes(CPS1985$gender)

levels(CPS1985$occupation)
```

     $levels
     [1] "male"   "female"
     
     $class
     [1] "factor"
     
     [1] "worker"     "technical"  "services"   "office"     "sales"     
     [6] "management"

Factor variables in regressions

out <- lm( log(wage) ~ education + age + gender + married + ethnicity + occupation, 
           data = CPS1985 ) 

modelsummary(list("Regression with factor variables"=out), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")

	Regression with factor variables
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	0.926***	0.144	6.415	<0.001	0.642	1.209
education	0.062***	0.010	6.429	<0.001	0.043	0.081
age	0.011***	0.002	6.267	<0.001	0.007	0.014
genderfemale	-0.228***	0.042	-5.395	<0.001	-0.311	-0.145
marriedyes	0.074+	0.042	1.768	0.078	-0.008	0.156
ethnicityhispanic	-0.121	0.089	-1.362	0.174	-0.294	0.053
ethnicityother	-0.066	0.058	-1.132	0.258	-0.180	0.048
occupationtechnical	0.146*	0.071	2.059	0.040	0.007	0.285
occupationservices	-0.192**	0.062	-3.069	0.002	-0.314	-0.069
occupationoffice	-0.043	0.064	-0.664	0.507	-0.169	0.083
occupationsales	-0.215**	0.082	-2.611	0.009	-0.376	-0.053
occupationmanagement	0.159*	0.076	2.085	0.038	0.009	0.309
Num.Obs.	534
R2	0.326
RMSE	0.43
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Interpretation and redefinitions

In interpreting the coefficients note that the coefficients are relative to the basis category, which is the one left out (by default the first category)
For instance \hat\beta_9, the coefficient of occupationservices = -0.192, means that, everything else equal, a person working for services earns 19.2% less than a person working as worker
If you want to change the basis category, you can do this with the function relevel()
- For instance, relevel(CPS1985$occupation, “management”) makes management to the basis of occupation
- With levels() you can define, name or rename the categories. For instance: levels(varname) \text{ } = \text{ } c(“low”, “high”)
It is also possible to recast a continuous variable into different categories, for instance a income variable into low, middle and high income:
- new_name \text{ } = \text{ } cut(old_name, 3) \text{ } (makes three categories; instead of a number you can also provide a vector with cutpoints)
- levels(new_name) \text{ } = \text{ } c(“low”, “mid”, “high”)

5.8 The linear probability model (LPM)

Until now, we only analysed models with dummy variables on the right hand side – as explanatory variables.

Now, we assume a linear regression with a binary dependent variable, i.e., a dummy variable as left hand variable.

For instance, y may represents labor force participation of married women, which is one if the woman is in the labor force and is zero if not

y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k + u

Under MLR.4, we have as usual:

\Rightarrow \ E(y \mid \mathbf x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k = \, \mathbf x \boldsymbol \beta

As the dependent variable only takes on the values 0 and 1 we have:

E(y \mid \mathbf x) := \, 1 \cdot P(y=1 \mid \mathbf x) + 0 \cdot P(y=0 \mid \mathbf x)

\Rightarrow \ P(y=1 \mid \mathbf x) = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k \tag{5.5}

This is the linear probability model, LPM: The model estimates the probability that y=1; in our example that a married woman is part of the labor force.

The coefficient \beta_j describes the partial effect of the explanatory variable j on the probability that y=1 \beta_j = \dfrac{\partial P(y=1\mid \mathbf x)}{\partial x_j}

For given \mathbf x, the error term u_i of the LPM can only take two values: u_i = \begin{cases} 1 - P(y=1 \mid \mathbf x), \ \text{ if } y_i = 1 \\ \ \ - P(y=1 \mid \mathbf x), \ \text{ if } y_i = 0 \end{cases} Whereby: P(y=1 \mid \mathbf x) \ = \ \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k \, = \ \mathbf x \boldsymbol \beta
Hence, for given \mathbf x, u_i as well as y_i are bernoulli distributed random variables
Although both possibles values of u_i depend on \mathbf x, the expected value of u_i, conditional on \mathbf x, surprisingly is zero. Hence, MLR.4 is automatically fulfilled, leading to unbiased and consistent estimates of \beta_j (assuming the model is correctly specified, which is generally questionable for the LPM):

E(u_i \mid \mathbf x) \ =

\left (1 - P(y=1 \mid \mathbf x) \right) \cdot P(y=1 \mid \mathbf x) \ + \ \left ( - P(y=1 \mid \mathbf x) \right) \cdot \underbrace{(1 - P(y=1 \mid \mathbf x)}_{P(y=0 \mid \mathbf x)} \ =

P(y=1 \mid \mathbf x) \ - \ P(y=1 \mid \mathbf x)^2 \ - \ P(y=1 \mid \mathbf x) \ + \ P(y=1 \mid \mathbf x)^2 \ = \ 0

However, as we show below, the variance of u_i depends on \mathbf x, hence, we have heteroscedastic errors

Heteroscedasticity of the error term in the LPM

To easy the notation we denote P(y=1\mid\mathbf x) as P_{y=1} and P(y=0\mid\mathbf x) as P_{y=0} in the following derivation.

We have:

\operatorname{Var}(u_i \mid \mathbf x) := \ E \left[ (u_i - E(u_i)\mid \mathbf x))^2 \mid \mathbf x \right] \ =

( \underbrace{- P_{y=1}}_{u_i \text{ if } y_i=0} - \underbrace {E(u_i\mid \mathbf x)}_0 )^2 \cdot P_{y=0} \ \ + \ \ ( \underbrace{1 - P_{y=1}}_{u_i \text{ if } y_i=1} - \underbrace {E(u_i\mid \mathbf x)}_0 )^2 \cdot P_{y=1} \ =

P_{y=1}^2 \cdot (1-P_{y=1}) \ \ + \ \ (1 - P_{y=1})^2 \cdot P_{y=1} \ =

P_{y=1}^2 - P_{y=1}^3 \ \ + \ \ (1 - 2P_{y=1} + P_{y=1}^2) \cdot P_{y=1} \ =

P_{y=1}^2 - P_{y=1}^3 \ \ + \ \ P_{y=1} - 2P_{y=1}^2 + P_{y=1}^3 \ \ = \ \ P_{y=1} - P_{y=1}^2 \ =

P_{y=1} \cdot (1-P_{y=1}) \ =

\mathbf x \boldsymbol \beta \, (1-\mathbf x \boldsymbol \beta)

The variance of u_i is lager if \mathbf x \boldsymbol \beta = P(y=1\mid\mathbf x) is in the “middle”, thus near 0.5. Note, the formula above matches the variance formula for a Bernoulli distributed random variable.

Example – labor force participation of married woman

library(AER); library(effects); library(wooldridge)
data(mroz); str(mroz)

      'data.frame': 753 obs. of  22 variables:
       $ inlf    : int  1 1 1 1 1 1 1 1 1 1 ...
       $ hours   : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
       $ kidslt6 : int  1 0 1 0 1 0 0 0 0 0 ...
       $ kidsge6 : int  0 2 3 3 2 0 2 0 2 2 ...
       $ age     : int  32 30 35 34 31 54 37 54 48 39 ...
       $ educ    : int  12 12 12 12 14 12 16 12 12 12 ...
       $ wage    : num  3.35 1.39 4.55 1.1 4.59 ...
       $ repwage : num  2.65 2.65 4.04 3.25 3.6 ...
       $ hushrs  : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
       $ husage  : int  34 30 40 53 32 57 37 53 52 43 ...
       $ huseduc : int  12 9 12 10 12 11 12 8 4 12 ...
       $ huswage : num  4.03 8.44 3.58 3.54 10 ...
       $ faminc  : num  16310 21800 21040 7300 27300 ...
       $ mtr     : num  0.721 0.661 0.692 0.781 0.622 ...
       $ motheduc: int  12 7 12 7 12 14 14 3 7 7 ...
       $ fatheduc: int  7 7 7 7 14 7 7 3 7 7 ...
       $ unem    : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
       $ city    : int  0 1 0 0 1 1 0 0 0 0 ...
       $ exper   : int  14 5 15 6 7 33 11 35 24 21 ...
       $ nwifeinc: num  10.9 19.5 12 6.8 20.1 ...
       $ lwage   : num  1.2102 0.3285 1.5141 0.0921 1.5243 ...
       $ expersq : int  196 25 225 36 49 1089 121 1225 576 441 ...
       - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"

Estimation

We estimate a model to explain the probability of labor force participation of married woman, inlf (for in labor force).

The explanatory variables are:

nwifeinc (additional family income),
educ,
exper,
exper^2,
age,
kidslt6 (number of children under 6 years) and kidsge6 (number of children over 6 years)

# LPM for labor forece participation of married woman
outm <- lm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age + 
             kidslt6 + kidsge6, 
           data = mroz)

# We calculate squared residuals and fitted values for plotting
ressq <-  outm$residuals^2
yhat <- outm$fitted.values

# Note, we have heteroskedasticity robust standard errors

modelsummary(list("y = P(y=1 | x)"=outm), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             align = "ldddddd",
             vcov = c(vcovHC),
             output = "gt")

Table 5.14: Labor force participation of married woman, HC standard errors

	y = P(y=1 \| x)
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	0.586***	0.154	3.812	<0.001	0.284	0.887
nwifeinc	-0.003*	0.002	-2.185	0.029	-0.006	-0.000
educ	0.038***	0.007	5.177	<0.001	0.024	0.052
exper	0.039***	0.006	6.600	<0.001	0.028	0.051
I(exper^2)	-0.001**	0.000	-2.997	0.003	-0.001	-0.000
age	-0.016***	0.002	-6.664	<0.001	-0.021	-0.011
kidslt6	-0.262***	0.032	-8.143	<0.001	-0.325	-0.199
kidsge6	0.013	0.014	0.953	0.341	-0.014	0.040
Num.Obs.	753
R2	0.264
R2 Adj.	0.257
AIC	865.8
BIC	907.4
Log.Lik.	-423.892
RMSE	0.42
Std.Errors	Custom
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

The estimated coefficients all have the expected sign. Of special interest is the variable kidslt6, the number of children under 6 years
- The estimated coefficient of -0.26 means that an additional young child reduces the probability of a married woman to be in the labor force by 26%. This is a huge effect! Children over six do not seem to have a notable impact

# Plotting squared residuals against an explanatory variable to 
# detect heteroscedasticity, which is clearly present 

plot(mroz$kidslt6,ressq)

# Residuals versus predicted values; for given x, 
# only two values for the residuals are possible

plot(outm$residuals ~ yhat )
abline(0,0)

Summary

Advantages of the linear probability model
- Easy estimation and interpretation; everything is pretty straightforward, which is not case for the alternative and more elaborate probit or logit models discussed in a later section
- Estimated effects and predictions are often reasonably good in practice
Disadvantages of the linear probability model
- Predicted probabilities may be larger than one or smaller than zero, which is not possible for a probability! In practice, this is often not so a big issue, as typically most of the predictions are in the interval [0, 1]
- The partial effects are constant, so large enough changes in x_j (for instance, 4 children under six years) can lead to changes in probability larger than one, which is not possible. Hence, strictly speaking, the model with constant partial effects cannot be correctly specified. However, very large variations in the explaining variables x_j don’t occur very often in most samples
  - Both problems are tackled with the probit or logit models mentioned above
- As shown above, the linear probability model is necessarily heteroscedastic
  - Hence, heteroscedasticity consistent standard errors need to be computed

5.9 Predictions

Having estimated an equation you can use this relationship to make predictions. However, these predictions are made with estimated parameters and are therefore subject to sampling errors, even if all usual assumptions are met. Hence, it is useful to calculate confidence intervals for predictions, based on the estimated regression line.

The predicted value of a regression model for specific values of \mathbf x_{f} (which typically are not in the sample) is:

\hat E(y_f \mid \mathbf x_f) \ = \ \hat y_f \ = \ \hat \beta_0 + \hat\beta_1 x_{f,1} + \ldots + \hat\beta_k x_{f,k}

Thereby, we simply plug in for \mathbf x_{f} in the estimated regression relationship

The conditional variance of the prediction for a two-variable model with de-meaned variables (so, we need no intercept) is

\operatorname {Var}(\hat y_f \, \mid \, \mathbf X, \mathbf x_f ) \ =

\operatorname {Var}(\hat\beta_1 \, \mid \, \mathbf X, \mathbf x_f) \, x_{f,1}^2 + \operatorname {Var}(\hat\beta_2 \, \mid \, \mathbf X, \mathbf x_f) \, x_{f,2}^2 +

2 \operatorname {Cov}(\hat\beta_1, \hat\beta_2 \, \mid \, \mathbf X, \mathbf x_f ) x_{f,1} x_{f,2}

The variance of the prediction is solely due to the uncertainty regarding the estimated parameters. Hence, this is a weighted sum of the variances and covariance(es) of the estimated coefficients
If we have estimated the variance of the prediction we can construct a 95% confidence interval in the sense that for repeated samples the true predictor E(y_f \mid \mathbf x_f) is within this confidence interval in 95% of the cases. This approximately is: ⁵

E(y_f\mid \mathbf x_f) \ \ \underset {95\%} \in \ \ \left[ \hat y_f \ \pm 2 \ \sqrt{ \widehat{\operatorname {Var}(\hat y_f \mid \mathbf X, \mathbf x_f )} } \right] \tag{5.6}

Note, this confidence interval is solely with regard to variations of \hat {\boldsymbol \beta} in repeated samples, i.e., to the uncertainty of the parameter estimates
Furthermore, if \hat \beta_j is conditionally normal distributed, then the same is true for \hat y_f

If we want to know how accurate the the prediction is we need to know the variance of the prediction error for y_f

The prediction error e_f is defined as

e_f := \ \ (y_f - \hat y_f) \ \ =

\underbrace {\beta_0 + \beta_1 x_{f,1} + \ldots +\beta_k x_{f,k} + u_f}_{y_f} \, - \,\underbrace { (\hat \beta_0 + \hat\beta_1 x_{f,1} + \ldots + \hat\beta_k x_{f,k}}_{\hat y_f} ) \, =

\underbrace{ (\beta_0 - \hat\beta_0) + (\beta_1 - \hat \beta_1) x_{f,1} + \cdots + (\beta_k - \hat \beta_k) x_{f,k} }_{\text{prediction error due to errors in parameter estimates}} \ \, + \underbrace {u_f}_{\text{due to u} }

Hence, the variance of the prediction error is the sum of the two independent components above:

\operatorname {Var}(e_f \mid \mathbf X, \mathbf x_f ) \, = \, \operatorname {Var}(\hat y_f \mid \mathbf X, \mathbf x_f ) + \sigma^2

The first part is the variance of prediction from above and the second part is the variance of the regression error term
Regarding the independence of these two parts, which is necessary for the derivation of the formula above, we have to discuss a subtle point: u_f is not used to estimate \hat\beta_j and therefore the two are statically independent
- The reason is that estimation is carried out with a particular sample and a particular realization of u. The prediction is carried out later, for an arbitrary x_f, and the prediction error realizes with another draw, u_f, from the distribution of u. Hence, \hat\beta_j and u_f are independent, if u is not autocorrelated (MLR.5)

We now can construct a 95% confidence interval in the sense that for repeated samples the actual y_f is within this confidence interval in 95% of the cases. This approximately is:

y_f \ \ \underset {95\%} \in \ \ \left[ \hat y_f \ \pm \ 2 ( \sqrt{ \widehat{\operatorname {Var}(\hat y_f \, \mid \, \mathbf X, \mathbf x_f )} + \hat \sigma^2 }) \right] \tag{5.7}

This second confidence interval is with regard to the actual realizations of y_f and tells us something about the distribution of y-values
The former confidence interval is with regard to the expected value of y_f and tells us something about the uncertainty in the mean. The two differ with regard to the variance of the error term
In R, the following naming is used:
- The former interval, the narrower one (regarding the expected value E(y_f | \mathbf x_f) is denoted by confidence (meaning the confidence in the mean)
- The latter interval, the wider one (regarding the realizations of y_f) is denoted by prediction (meaning interval of the prediction of y_f)
Final remark: If x_f is in the sample, the prdiction error e_f is simply the residual \hat u_f

Example

We, once more, use the model which predicts the test score of students dependent on the attendance rate of a microeconomic lecture and prior test results, priGPA and ATC. This model was stored in the list myres.
Now, we want to predict the outcome for three artificial students, the first two are average students but the first does not attend the micro lecture, and the second one attends the lecture. The third has a very good test record and attends the lecture. In particular we have the following data x_f
Student A: atndrte = 0, priGPA = mean(priGPA), ATC = mean(ATC)
Student B: atndrte = 100, priGPA = mean(priGPA), ATC = mean(ATC)
Student C: atndrte = 100, priGPA = 4, ATC = 30
We have to store these values in a new data frame and call the function predict()
We demand both confidence intervals with the option “interval =”

# Specify values of exogenous variables for which we want a prediction
# the data have to be stored in a data.frame

# Calculating means
mean_priGPA = mean(attend$priGPA)
mean_ACT = mean(attend$ACT)

xf <-  data.frame( atndrte = c(0,100,100) , priGPA = c(mean_priGPA, mean_priGPA, 4), ACT = c(mean_ACT, mean_ACT, 30 ) )

# Prediction and confidence interval for expected values
predict(myres, xf, interval = "confidence")

                fit        lwr        upr
     1 -0.767446712 -1.2091453 -0.3257481
     2  0.006209056 -0.1222101  0.1346282
     3  2.072518373  1.7088323  2.4362044

# Prediction and confidence interval for actual values
predict(myres, xf, interval = "prediction")

                fit        lwr      upr
     1 -0.767446712 -2.5373289 1.002435
     2  0.006209056 -1.7124755 1.724894
     3  2.072518373  0.3204759 3.824561

# With `plot_predictions` from the package `marginaleffects` we can draw the
# confidence interval for expected values (representing parameter uncertainty) 
# in dependence  of a particular variable and the other variables are at their 
# mean values
plot_predictions(myres, condition = "atndrte", points=0.2, rug=TRUE) + theme_bw()

Figure 5.17: Prediction of the mean plus convidence interval, compare Figure 1.6

Predictions with log(y)

If the depended variable is in logs the predicted value of y_i is not simply:

\hat y \ \neq \ \exp ({\widehat {\log y_i}}) \ = \ \exp( \underbrace {\hat\beta_0 + \hat\beta_1 x_1 + \cdots + \hat\beta_k x_k}_{\widehat {\log y_i}} )

In particular we have

\log y \ = \ \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + u \ \ \Rightarrow

y \ = \ \exp (\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k) \cdot \exp (u) \ \ \Rightarrow

E(y \mid \mathbf x ) \ = \ \exp (\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k) \cdot E( \exp (u) \mid \mathbf x )

If u_i is normally distributed and independent of \mathbf x, than E( \exp (u) \mid \mathbf x ) = \exp (\sigma^2 / 2 ) and a prediction for y_i is

\hat y_i \ = \ \exp({\widehat{\log y_i}}) \exp(\hat\sigma^2 / 2)

Here, \hat\sigma^2 is our standard estimator for \sigma^2
This estimator for y_i is biased but consistent (it is biased, because \sigma^2 is replaced by \hat \sigma^2 inside a non-linear function, compare Wooldridge (2019), p. 206)
- Note that \exp(\hat\sigma^2 / 2) is always > 1, thus \exp ({\widehat {\log y_i}}) would typically underestimate y_i
If the assumption of normally distributed errors is not fulfilled, as is often the case, a MoM estimator can be applied: Replace E( \exp (u) \mid \mathbf x ) with the corresponding sample average to get a consistent estimate of \hat y_i

\hat y_i \ = \ \exp ({\widehat{\log y_i}}) \left( \frac{1}{n} \sum_{i=1}^n \exp (\hat u_i) \right)

Matrix notation

Predictions – Matrix notation

The predicted value(s) for a particular row vector \mathbf x_f is:

\hat { y}_f = \mathbf x_f \hat {\boldsymbol \beta}

With the help of our variance formula from Equation C.7 and the covariance matrix for \hat {\boldsymbol \beta} from Equation C.9 we can calculate the variance of \hat {y}:

\operatorname{Var}(\hat { y}_f) = \sigma^2 \mathbf x_f (\mathbf X' \mathbf X)^{-1} \mathbf x_f ' \tag{5.8}

And the corresponding estimated confidence interval is (compare Equation 5.6)

\hat {y}_f \ \pm \ 2 \sqrt { \hat \sigma^2\mathbf x_f (\mathbf X' \mathbf X)^{-1} \mathbf x_f'} \tag{5.9}

The prediction error is defined by:

e_f := \, y_f - \hat{y}_f

This leads to:

e_f \ = \ \underbrace{{\mathbf x_f\boldsymbol \beta} + u_f }_{y_f} - \underbrace{\mathbf x_f \hat {\boldsymbol \beta}}_{\hat{y}_f} \ =

u_f - \mathbf x_f (\hat {\boldsymbol \beta} - \boldsymbol {\beta} )

Hence, applying the variance formula from Equation C.7 and the covariance matrices for \hat {\boldsymbol \beta} and u and considering the independence of the two terms (u_f is from another realization than the u-s’ which were relevant for estimating \hat{\boldsymbol \beta}) we get the variance of the prediction error:

\operatorname{Var}(e_f) \ = \ \sigma^2 + \sigma^2 \mathbf x_f (\mathbf X' \mathbf X)^{-1} \mathbf x_f ' \ = \ \sigma^2 \left(1 + \mathbf x_f (\mathbf X' \mathbf X)^{-1} \mathbf x_f ' \right) \tag{5.10}

And the estimated confidence interval regarding the prediction error is (compare Equation 5.7)

{\hat {y}}_f \ \pm \ 2 \sqrt {\hat \sigma^2 + \hat \sigma^2 \mathbf x_f (\mathbf X' \mathbf X)^{-1} \mathbf x_f'} \tag{5.11}

Studentized residuals

Studentized residuals - Matrix notation

Studentized residuals (more correctly, internally studentized residuals or standardized residuals) are scaled residuals, divided by their standard deviation, so that their variance equals to one. They also fix the problem that the variance of residual i depends on the distance between \mathbf x_i and its mean (the variance decreases as the corresponding \mathbf x_i gets farther away from its average – regressions are better fitting at the ends of the domain)
As this involves the calculation of the variance of the residuals \hat{\mathbf u} := \, \mathbf y - \hat{\mathbf y}, the problem is similar to the one of calculating the prediction error
We are looking for the variance of \hat{\mathbf u} (considering Equation C.4 and Equation C.11)

\hat{\mathbf u} \, = \, \underbrace{{\mathbf X\boldsymbol \beta} + \mathbf u }_{\mathbf y} - \underbrace{\mathbf X \hat {\boldsymbol \beta}}_{\hat{\mathbf y}} \ =

\mathbf u - \mathbf X (\hat {\boldsymbol \beta} - \boldsymbol {\beta} ) \ = \ \mathbf u - \mathbf X \underbrace {(\mathbf X' \mathbf X)^{-1} \mathbf X' \mathbf u}_{(\hat {\boldsymbol \beta} - \boldsymbol {\beta} )} \ =

\underbrace {(\mathbf I - \mathbf P_x)}_{\mathbf M_x} \, \mathbf u

It is important to note, that in this context the error term \mathbf u is obviously correlated with \hat {\boldsymbol \beta}. Furthermore, remember that \mathbf M_x as well as \mathbf P_x are idempotent and symmetric. The covariance matrix of \mathbf u therefore is:

\operatorname{Var}(\hat{\mathbf u}) \ = \ (\mathbf I - \mathbf P_x) \ \sigma^2 \mathbf I\ (\mathbf I - \mathbf P_x)' \ =

\sigma^2 (\mathbf I - \mathbf P_x)

Sometimes, the projection matrix \mathbf P_x is denoted as the hat matrix \mathbf H
The i^{th} diagonal element of this covariance matrix represents the variance of the residual \hat u_i:

\operatorname{Var}(\hat u_i) \ = \ \sigma^2 \left(1 - \mathbf x_i (\mathbf X' \mathbf X)^{-1} \mathbf x_i' \right)

This formula for the variance is very similar to the one for the prediction error from above. They only differ in a sign. This difference is due to a different stochastic context: In the case of prediction errors u and \hat \beta are independent (estimation and predictions are done with different random draws of u), whereas in the context of calculating residuals, \hat \beta obviously depends on u
\mathbf x_i (\mathbf X' \mathbf X)^{-1} \mathbf x_i', the i^{th} diagonal element of \mathbf H respectively \mathbf P_x, is often denoted as h_{ii}. Hence, we can write for the estimated standard deviation of \hat u_i

\operatorname{se}(\hat u_i) \ = \ \hat \sigma \sqrt{1 - h_{ii} } \tag{5.12}

We divide each regression residual by the corresponding estimate of its standard deviation to get the studentized (standardized) residuals

t_i := \ \dfrac {\hat u_i}{\hat \sigma \sqrt{1 - h_{ii} }} \tag{5.13}

By construction, \operatorname{Var}(t_i)=1
- The formula for externally studentized residuals somewhat differs; \hat y_i is replaced by \hat y_{(i)} which denotes the predicted values of y_i based on the estimated model with the i^{th} observation deleted. The residual \hat u_{(i)} is then defined by (y_i - \hat y_{(i)}). \hat\sigma is replaced by \hat\sigma_{(i)}, which is calculated accordingly. The externally studentized residuals are usually used for an outlier analysis
h_{ii} is the so called leverage of observation i. The leverage depends on the distance of \mathbf x_i to the mean of \mathbf x_i and is a measure of the potential influence of observation i on the predicted value \hat y_i: \, h_{i i} = \frac{\partial \hat{y}_{i}}{\partial y_{i}}
External studentized residuals plays some role in finding outliers (rule of thumb: t_i > 3) or structural breaks
There is a “Trick” to compute (externally) studentized residuals “by hand” for each observation:
1. Define a dummy variable for each observation i
2. Estimate n regressions, each including another dummy variable
3. The coefficients of the dummies corresponds to the particular residuals
4. The t-statistic associated with the dummy is the studentized residual (coef / sd of coef)

In this case, the APE is equal to the Partial Effect at Average (PEA), but this is not always the case.↩︎
See Card and Krueger (1994) for a famous and controversial investigation of the effects of minimum wages on employment using a DiD estimator. They found a positive effect of rising minimum wages on employment with their DiD estimators, which is not easy to explain and suggest the assumption of an involved bias. For a critical comment on this work see Neumark and Wascher (2000). For a more modern approach with a different result see Callaway and Sant’Anna (2021).↩︎
In a panel setting (which is not the case here – we have a pooled cross-section) with units i and two periods we simply could regress:
\Delta lprice_i = \beta_0 + \beta_1 nearinc_i + u_i, or
lprice_{it} = \beta_1 nearinc_{it} + a_i + d_{t} + u_{it},
with nearinc_{it} is 0 before the incinerator was build and 1 after that event for those units nearby, a_i being unit-specific intercepts and d_t time-specific intercepts (both resulting from the inclusion of corresponding dummy variables, see Chapter 9).
Actually, if the same units are observed in the two (or more) periods, these specifications are most often used in applied work.↩︎
As usual, both variables have to be included in the equation for a correct interpretation.↩︎
With the true predictor E(y_f \mid \mathbf x_f) we mean the corresponding point on the population regression function, Figure 1.2.↩︎

5.1 Topics of this chapter

5.2 Interaction terms

A reparametrization

Average partial effects

5.2.1 Interaction terms - Example

APE

Reparametrization

Diagrams of non-linear effects

5.3 Dummy variables

Example data

Example - Wage equation

Interpretation of results of this example

Boxplot

Interpretation of results

5.4 Interactions with dummy variables

Interpretation of the coefficients

Test, whether all variables which contain female are jointly insignificant

The APEs for the variables

5.5 Treatment effects and the DiD estimator

Example – house prices

Result of the example above:

How does this work?

DiD estimator as interaction term

Let’s estimate the model with an interaction term as outlined above:

Simulation study

5.6 Chow Test

Carrying out a F-test with formula from Equation 3.15

5.7 Categorical data in R

Example data set

Attributes of a factor variable

Factor variables in regressions

Interpretation and redefinitions

5.8 The linear probability model (LPM)

Example – labor force participation of married woman

Estimation

Summary

5.9 Predictions

Example

Predictions with log(y)

Matrix notation

Studentized residuals

Test, whether all variables which contain `female` are jointly insignificant